Saturday, October 20, 2012

Software-Implemented Fault Tolerance



This post contains ideas from lectures of Prof. Dr. Christof Fetzer (SA INF TU Dresden) and
the main part of the info has been provided by André Schmitt (Silistra).




Hello dear DS readers,

This post is devoted to the promised overview of the software-based solutions that allow us to tolerate hardware faults. In the internet you can google acronyms like SWIFT (Software Implemented Fault Tolerance) or SIHFT (Software-implemented Hardware Fault Tolerance). Both have the same meaning and describe a group of methods that help to organize, enforce, protect, optimize your software in a such way that it will be tolerant of bit flips. In this post I would like to discuss general SIHFT approaches and give useful links to the publications for further details.

Control flow checking

Detection of hardware errors using control flow checking provides the means for recognizing of invalid control flow of the executed program. Execution of sequences of instructions that are not permitted for the executed binary can be detected. Control flow checking cannot detect errors, which do only influence processed data. Usually, control flow checking is only applied for inter-basic-block control flow. Erroneous control flow within the boundaries of basic-blocks is not detectable. Here is a simple example. We decompose the source code (on the instruction level) into the basic blocks. A basic block is a number of instructions in-between two control flow jumps (branching instructions). After that we generate unique exit-signatures for all blocks. We instrument our code in a such way that at the end of each block we save this signature into a special register, and than, check it at the beginning of the next block. This helps to tolerate bit-flips in control flow instructions and prevent wrong control flow jumps.
  • Edson Borin, Cheng Wang, Youfeng Wu, and Guido Araujo. Software-based transparent and comprehensive control-flow error detection. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), Washington, DC, USA, 2006. IEEE Computer Society. 6, 7
  • Ramtilak Vemu and Jacob A. Abraham. CEDA: Control-flow error detection through assertions. In IOLTS ’06: Proceedings of the 12th IEEE International Symposium on On-Line Testing. IEEE Computer Society, 2006. 6, 7 

Invariant-based methods

Algorithm-based fault tolerance and self-checking software use invariants to check the validity of the generated results. These methods require the existence of appropriate invariants, which provide a good generic failure detection capability. However, they are not easy (if not impossible) to find for most applications or limited to specific algorithms and program sections. Generic invariants as assertion-based loop invariant only detect significant variations of the results reliably and may miss subtle errors or side-effects. 
  • V.K. Stefanidis and K.G. Margaritis. Algorithm based fault tolerance : Review and experimental study. In International Conference of Numerical Analysis and Applied Mathematics, 2004. 6, 7
  • Hal Wasserman and Manuel Blum. Software reliability via run-time result-checking. J. ACM, 1997. 6, 7 


Redundancy

Other software approaches for detecting hardware errors work with replicated execution and comparison (voting) of the obtained results. The protected software is modified during or before compilation – rarely, dynamic binary instrumentation is used. Replication is applied at different levels of abstraction. A number of approaches duplicate single instructions within one thread. Others execute duplicates of the whole program using several threads. This methods are similar to the typical hardware reliability approaches like double and triple module redundancies. Duplication of the instructions helps only to detect a bit flip, triple execution can even mask it.
  • George A. Reis, Jonathan Chang, David I. August, Robert Cohn, and Shubhendu S. Mukherjee. Configurable transient fault detection via dynamic binary translation. In Proceedings of the 2nd Workshop on Architectural Reliability (WAR), 2006. 7 
  • C. Bolchini, A. Miele, M. Rebaudengo, F. Salice, D. Sciuto, L. Sterpone, and M. Violante. Software and hardware techniques for SEU detection in IP processors. J. Electron. Test., 2008. 6, 7 
  • Cheng Wang, Ho seop Kim, Youfeng Wu, and Victor Ying. Compiler-managed software-based redun- dant multi-threading for transient fault detection. In International Symposium on Code Generation and Optimization (CGO), 2007. 7 
Replicated execution and comparison methods help against bit flips in CPU during instruction processing. In a similar way, we can tolerate bit flips in memory. For this purpose, we have to store two or three copies of our variables and compare them on read. The main drawback of these methods is a very high performance (execution time, memory consumption) overhead.
  • George A. Reis, Jonathan Chang, and David I. August. Automatic instruction-level software-only recovery. IEEE Micro, 27:36–47, January 2007. 2
  • George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. Swift: Software implemented fault tolerance. In Proceedings of the international symposium on Code generation and optimization, CGO ’05, pages 243–254, Washington, DC, USA, 2005. IEEE Computer Society. 2 

Arithmetic codes

Instead of duplication, or in addition to it, arithmetic codes can be used to detect errors. The arithmetic codes are conserved by correct arithmetic operations that is, a correctly executed operation taking valid code words as input produces a result that is again a valid code word. On the other hand, faulty arithmetic operations do not conserve the code with a very high probability, that is, faulty operations result in a non-valid code word. Furthermore, arithmetic codes protect data from unnoticed modifications during storage or transport on a bus. These random modifications with high probability will not result in a valid code word. Thus, the arithmetic codes facilitate not only the detection of soft errors but can catch permanent hardware faults, too. AN, ANB, and ANBD encodings are different types of arithmetic code protections. Note that in the ISO standard for the functional safety of road vehicles (ISO 26262) encoding with an ANBD-code is one of the two means to reach the ASIL-D level – the highest safety level.
  • A. Avizienis. Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design. In Transactions on Computers, 1971. 7 
  • Nahmsuk Oh, Subhasish Mitra, and Edward J. McCluskey. ED4I: Error detection by diverse data and duplicated instructions. IEEE Trans. Comput., 51, 2002. 7
  • onathan Chang, George A. Reis, and David I. August. Automatic instruction-level software-only recovery. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), Washington, USA, 2006. 7
  • Ute Schiffel, Martin Su ̈ßkraut, and Christof Fetzer. AN-encoding compiler: Building safety-critical systems with commodity hardware. In The 28th International Conference on Computer Safety, Reliability and Security (SafeComp 2009), 2009. 7, 9, 10
  • Ute Wappler and Christof Fetzer. Hardware failure virtualization via software encoded processing. In 5th IEEE International Conference on Industrial Informatics (INDIN 2007), 2007. 7, 9, 8, 10
  • Ute Wappler and Christof Fetzer. Software encoded processing: Building dependable systems with com- modity hardware. In The 26th International Conference on Computer Safety, Reliability and Security (SafeComp 2007), 2007. 7, 9, 10