Wednesday, January 30, 2013

Reliability: Methodologies, Standards, and Tools


Hello dear DS readers,

Today I had a presentation at Fraunhofer about the most widely used reliability methodologies like RBD, FTA, FMEA etc. Here are the slides from this presentation that give general descriptions of the classical reliability methods, industrial standards, and software tools.

Also, there is no post about these methodologies in my blog. So, in order to cover this gap, I've recorded six short video podcasts using these slides. I hope they are not too boring and maybe helpful.


Part 1: This is the first part, the introduction,  of a presentation about reliability methodologies, standards, and tools.

Part 2: This part of the presentation describes practical reliability metrics: MTTF, MTBF, MTTR, and failure rate.


Part 3: This part of the presentation tells about reliability block diagrams, fault trees, and event trees.

Part 4: This part of the presentation describes the mathematical models that can be used for reliability analysis: Markov chains and Petri nets.

Part 5: This part of the presentation discusses the high level methodologies for reliability analysis, such as Root Cause Analysis (RCA), Failure Mode and Effect Analysis (FMEA), and Hazard and operability study (HAZOP).

Part 6: This part gives an overview of software reliability methods, standards, and tools. + Conclusion.

Saturday, December 15, 2012

Dual-graph Model for Error Propagation Analysis of Mechatronic Systems

The electronic version of my PhD thesis is available free of charge.

You can also purchase a hardcopy at amazon or on the publisher's web-page ;)
or on the google books.

Fast abstract:

Error propagation analysis is an important part of a system development process. This thesis addresses a probabilistic description of the spreading of data errors through a mechatronic system. An error propagation model for these types of systems must use a high abstraction layer that allows the proper mapping of the mutual interaction of heterogeneous system components such as software, hardware, and physical parts.

A literature overview reveals the most appropriate error propagation model that is based on Markovian representation of control flow. However, despite the strong probabilistic background, this model has a significant disadvantage. It implies that data errors always propagate through the control flow. This assumption limits model application to the systems, in which components can be triggered in arbitrary order with non-sequential data flow.

A motivational example, discussed in this thesis, shows that control and data flows must be considered separately for an accurate description of an error propagation process. For this reason, a new concept of system analysis is introduced. The central idea is a synchronous examination of two directed graphs: a control flow graph and a data flow graph. The structures of these graphs can be derived systematically during system development. The knowledge about an operational profile and properties of individual system components allow the definition of additional parameters of the error propagation model.

A discrete time Markov chain is applied for the modeling of faults activation, errors propagation, and errors detection during operation of the system. A state graph of this Markov chain can be generated automatically using the discussed dual-graph representation. A specific approach to computation of this Markov chain makes it possible to obtain the probabilities of all erroneous and error-free system execution scenarios. This information plays a valuable role in development of dependable systems. For instance, it can help to define an effective testing strategy, to perform accurate reliability estimation, and to speed up error detection and fault localization processes.

This thesis contains a comprehensive description of a mathematical frame- work of the new dual-graph error propagation model, several methods for error propagation analysis, and a case study that demonstrates key features of the application of the presented error propagation model to a typical mecha- tronic system. A numerical evaluation of the mechatronic system in question proves applicability of the introduced concept.

Saturday, October 20, 2012

Software-Implemented Fault Tolerance



This post contains ideas from lectures of Prof. Dr. Christof Fetzer (SA INF TU Dresden) and
the main part of the info has been provided by André Schmitt (Silistra).




Hello dear DS readers,

This post is devoted to the promised overview of the software-based solutions that allow us to tolerate hardware faults. In the internet you can google acronyms like SWIFT (Software Implemented Fault Tolerance) or SIHFT (Software-implemented Hardware Fault Tolerance). Both have the same meaning and describe a group of methods that help to organize, enforce, protect, optimize your software in a such way that it will be tolerant of bit flips. In this post I would like to discuss general SIHFT approaches and give useful links to the publications for further details.

Control flow checking

Detection of hardware errors using control flow checking provides the means for recognizing of invalid control flow of the executed program. Execution of sequences of instructions that are not permitted for the executed binary can be detected. Control flow checking cannot detect errors, which do only influence processed data. Usually, control flow checking is only applied for inter-basic-block control flow. Erroneous control flow within the boundaries of basic-blocks is not detectable. Here is a simple example. We decompose the source code (on the instruction level) into the basic blocks. A basic block is a number of instructions in-between two control flow jumps (branching instructions). After that we generate unique exit-signatures for all blocks. We instrument our code in a such way that at the end of each block we save this signature into a special register, and than, check it at the beginning of the next block. This helps to tolerate bit-flips in control flow instructions and prevent wrong control flow jumps.
  • Edson Borin, Cheng Wang, Youfeng Wu, and Guido Araujo. Software-based transparent and comprehensive control-flow error detection. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), Washington, DC, USA, 2006. IEEE Computer Society. 6, 7
  • Ramtilak Vemu and Jacob A. Abraham. CEDA: Control-flow error detection through assertions. In IOLTS ’06: Proceedings of the 12th IEEE International Symposium on On-Line Testing. IEEE Computer Society, 2006. 6, 7 

Invariant-based methods

Algorithm-based fault tolerance and self-checking software use invariants to check the validity of the generated results. These methods require the existence of appropriate invariants, which provide a good generic failure detection capability. However, they are not easy (if not impossible) to find for most applications or limited to specific algorithms and program sections. Generic invariants as assertion-based loop invariant only detect significant variations of the results reliably and may miss subtle errors or side-effects. 
  • V.K. Stefanidis and K.G. Margaritis. Algorithm based fault tolerance : Review and experimental study. In International Conference of Numerical Analysis and Applied Mathematics, 2004. 6, 7
  • Hal Wasserman and Manuel Blum. Software reliability via run-time result-checking. J. ACM, 1997. 6, 7 


Redundancy

Other software approaches for detecting hardware errors work with replicated execution and comparison (voting) of the obtained results. The protected software is modified during or before compilation – rarely, dynamic binary instrumentation is used. Replication is applied at different levels of abstraction. A number of approaches duplicate single instructions within one thread. Others execute duplicates of the whole program using several threads. This methods are similar to the typical hardware reliability approaches like double and triple module redundancies. Duplication of the instructions helps only to detect a bit flip, triple execution can even mask it.
  • George A. Reis, Jonathan Chang, David I. August, Robert Cohn, and Shubhendu S. Mukherjee. Configurable transient fault detection via dynamic binary translation. In Proceedings of the 2nd Workshop on Architectural Reliability (WAR), 2006. 7 
  • C. Bolchini, A. Miele, M. Rebaudengo, F. Salice, D. Sciuto, L. Sterpone, and M. Violante. Software and hardware techniques for SEU detection in IP processors. J. Electron. Test., 2008. 6, 7 
  • Cheng Wang, Ho seop Kim, Youfeng Wu, and Victor Ying. Compiler-managed software-based redun- dant multi-threading for transient fault detection. In International Symposium on Code Generation and Optimization (CGO), 2007. 7 
Replicated execution and comparison methods help against bit flips in CPU during instruction processing. In a similar way, we can tolerate bit flips in memory. For this purpose, we have to store two or three copies of our variables and compare them on read. The main drawback of these methods is a very high performance (execution time, memory consumption) overhead.
  • George A. Reis, Jonathan Chang, and David I. August. Automatic instruction-level software-only recovery. IEEE Micro, 27:36–47, January 2007. 2
  • George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, and David I. August. Swift: Software implemented fault tolerance. In Proceedings of the international symposium on Code generation and optimization, CGO ’05, pages 243–254, Washington, DC, USA, 2005. IEEE Computer Society. 2 

Arithmetic codes

Instead of duplication, or in addition to it, arithmetic codes can be used to detect errors. The arithmetic codes are conserved by correct arithmetic operations that is, a correctly executed operation taking valid code words as input produces a result that is again a valid code word. On the other hand, faulty arithmetic operations do not conserve the code with a very high probability, that is, faulty operations result in a non-valid code word. Furthermore, arithmetic codes protect data from unnoticed modifications during storage or transport on a bus. These random modifications with high probability will not result in a valid code word. Thus, the arithmetic codes facilitate not only the detection of soft errors but can catch permanent hardware faults, too. AN, ANB, and ANBD encodings are different types of arithmetic code protections. Note that in the ISO standard for the functional safety of road vehicles (ISO 26262) encoding with an ANBD-code is one of the two means to reach the ASIL-D level – the highest safety level.
  • A. Avizienis. Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design. In Transactions on Computers, 1971. 7 
  • Nahmsuk Oh, Subhasish Mitra, and Edward J. McCluskey. ED4I: Error detection by diverse data and duplicated instructions. IEEE Trans. Comput., 51, 2002. 7
  • onathan Chang, George A. Reis, and David I. August. Automatic instruction-level software-only recovery. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), Washington, USA, 2006. 7
  • Ute Schiffel, Martin Su ̈ßkraut, and Christof Fetzer. AN-encoding compiler: Building safety-critical systems with commodity hardware. In The 28th International Conference on Computer Safety, Reliability and Security (SafeComp 2009), 2009. 7, 9, 10
  • Ute Wappler and Christof Fetzer. Hardware failure virtualization via software encoded processing. In 5th IEEE International Conference on Industrial Informatics (INDIN 2007), 2007. 7, 9, 8, 10
  • Ute Wappler and Christof Fetzer. Software encoded processing: Building dependable systems with com- modity hardware. In The 26th International Conference on Computer Safety, Reliability and Security (SafeComp 2007), 2007. 7, 9, 10

Thursday, May 24, 2012

Flip happens :(

Hello dear DS readers,

Let me start this post with a sad story about a recent Russian space mission Phobos-Grunt.

"16 February 2012—The failure of Russia’s ambitious Phobos-Grunt sample-return probe has been shrouded in confusion and mystery, from the first inklings that something had gone wrong after its 9 November launch all the way to inconsistent reports of where it fell to Earth on 15 January." More detailed info you can find here. The image by Michael Carroll.

According to the official report of Roscosmos, the most likely cause of this failure was an SRAM fault caused by ”a local influence of heavy charged particles”, aka galactic cosmic rays.
This is a particular case of a well-known hardware fault, so-called "bit-flip".


A negative environmental impact like increasing heat, lowering voltage, or cosmic radiation, like in the case of the Phobos Grunt, corrupt a part of system’s memory. This can result in a single or several bit-flips, like it shown in the figure. The bit-flips may change the application state, for instance, the value of a critical variable. Later, during the execution of some software function, this erroneous value can be read and propagate further as a system error. Such an error may lead to various unintended consequences. Similar hardware failures can happen not only in memory, but in the CPU or on a BUS. 

A number of research projects aim this problem. Roughly speaking, all of them can be classified into two groups: hardware-based and software-based. Heat and radiation protected hardware or memory/CPU/cache redundancy are typical hardware-based solutions. However, these approaches usually have a number of disadvantages like cost, limited markets, and extremely low performance. 

The second group contains software-based approaches to bit-flip detection and masking. In my opinion, these solutions are much more advanced, interesting, and feet better to the scope of the DS blog. In the next post I plan to give an overview of existing methods and even tools to cope with bit-flips. 

Here, as a teaser, I want to share the next fantastic video created by my colleagues from German R&D сompany Silistra.


Thursday, February 9, 2012

Markov-based Reliability Models. Part 2.

Hello dear readers,

Let me continue with the Markov-based reliability models. This post discusses how the system-level reliability can be estimated with the help of control flow analysis. Roger C. Cheung  from the Bell Labs published one of the first articles devoted to this idea in 1980: "A USER-ORIENTED SOFTWARE RELIABILITY MODEL". In this post I would like to discuss this concept using a simple example.

Figure A

Assume that we have a system of three modules as it shown in Figure A. R. Cheung has used the term "modules", however, it can be any other software executable elements like methods, functions, services, or even activities of hardware. The key point: is that we know the control flow structure. 

Control flow of this example is rather simple: There is a decision block after the module 1 that defines witch of the modules will be executed next. Assume that we know the preliminary operational profile of this system: After execution of Module 1 in 5 cases of 10 the control will return to module 1, in 3 cases it will be passed to the module 3, and in 2 cases the module 2 will be executed. This information makes it possible to define the control flow of this system as a discrete time Markov chain. The state graph of this chain is shown in Figure B. Each node of this graph represents the execution of the corresponding module. This Markovian representation allows numerical computation of the probabilities of execution of each module. However, even without this computation we can see that Module 1 can be executed  several times, Module 2 will be executed only with the probability of 0.2, and Module 3 will be executed exactly one time, during the execution of the entire system. 




Figure BFigure C

The modules are considered to be faulty, and the faults can be activated during the execution of the modules. It is assumed that a fault activation results in the immediate failure of the system, as well as that we know the fault activation probabilities for each module: f1, f2, and f3 respectively.

The final goal is to compute the probability that the system will not fail during its execution. It is obvious that we have to consider two aspects, the probability of execution and the probability of fault activation for each module.

Cheung proposes to transform the original state graph into the one shown in the Figure C. In oder to do that, we add two additional states, "OK" and "FAIL". After that, we add an arc from the node "State 3" to the node "OK", weighted with the probability 1-f3. This arc defines successful completion of system execution. Also, we add 3 arcs from each node of the original graph to the node "FAIL". These arcs are weighted with the corresponding probabilities of fault activation, f1, f2, and f3. The weights of the rest outgoing arcs are recomputed using the original proportion (see Figure C).

This new generated graph describes utilization of the system. It starts with the execution of the module 1 and ends with successful execution of the module 3 (State OK) or with a system failure (State FAIL). The mathematical framework of the absorbing discrete time Markov chains enables the computation of the probabilities of process completion in one of these two states (see this book chapter for details). It means that the probability that the process will end in "State OK" is exactly the desired reliability of this system.

In conclusion, I want to say that this model describes the most simple way of reliability estimation with respect to the system control flow and contains strong assumptions and restrictions such as: 

1) System control flow must satisfy the Markov property: Future states of the process depend only upon the present state, not on the sequence of previous events. 
2) Fault activations in each module are considered to be independent events. 
3) A fault activation results in immediate system failure.

and so on..

Nevertheless, keep in mind that it was presented in 1980! Nowadays, analysts use this idea to create more advanced reliability models that consider e.g. continues nature of system execution, conditional probabilities of fault activation, error propagation phenomena, and many other important factors.

Tuesday, December 6, 2011

Markov-based Reliability Models. Part 1.

Hello dear readers,

Let me discuss a group of reliability models that have one common feature. These models are based on Markov chains.



Some history. In 1906, Andrey Markov presented the study of an important new type of chance processes. In this process, the outcome of a given experiment can affect the outcome of the next experiment. At the first time, a term ”chain” was used by Markov in [A. A. Markov. Rasprostranenie zakona bol’shih chisel na velichiny, zavisyaschie drug ot druga. Izvestiya Fiziko-matematicheskogo obschestva pri Kazanskom universitete, 15(2):135–156, 1906.]. He has produced the first results for these processes, purely theoretically. A generalization to countably infinite state spaces was given later by A. Kolmogorov in 1936. The Markov chains are related to Brownian motion and the ergodic hypothesis, two topics in physics which were important in the early years of the twentieth century, but Markov appears to have pursued this out of a mathematical motivation, namely the extension of the law of large numbers to dependent events. 
Nowadays, the Markov chains are widely used in the engineering domain for system analysis, modeling, and estimation of various non-functional system properties. An infinite number of books describe different types (discrete, continues, absorbing, ergodic, regular etc.) of Markov chains, using a bunch of complex probabilistic terms. The most understandable representation of a Markov chain is a directed state graph. The nodes of this graph - define state space of the system. The arcs - transitions form one state to another. These arcs are weighted with transition probabilities. The sum of the transition probabilities of outgoing arcs of each node equals to 1.

The next example gives an intuition into application of a Markov chain for simple reliability analysis. Assume, we have a faulty system that starts its regular operation. During this operation, a fault can be activated with a known probability P_FA. The fault activation leads to an erroneous system state. However, with the probability P_ED an error can be detected. After that, the error can be corrected with a probability P_EC that restores the original system state. Otherwise, the system stops with an error message (Fail stop) in oder to prevent a system failure. In the case if the erroneous system state is not identified, the systm fails.
This Markov chain describes behavior of the discussed system. 'Regular system operation' is an initial state of the system. We can see that with the probability P_FA it moves to 'erroneous system operation' and  successfully completes its operation with the probability (1-P_FA). Error detection behavior is modeled in the same manner. The final states 'Intended completion', 'Fail stop', and 'System failure' represent three possible system execution scenarios. In the case if the P_FA, P_ED, and P_EC are known, we are able to compute probabilities of these scenarios. For instance, the probability of a system failure equals to

P_SF = (P_FA)*(1-P_ED) + (P_FA)*(P_ED*P_EC)*(P_FA)*(1-P_ED) + (P_FA)*((P_ED*P_EC)*P_FA)^2*(1-P_ED) + ... =  P_FA*(1-P_ED) / (1-P_ED*P_EC*P_FA)

(1-P_SF)  represents the probability of failure free system execution and can be considered as a measure of system reliability. 

This trivial example demonstrates a very general idea of the Markov chain application for system reliability analysis. The state space of the Markov chain can be much bigger and e.g. distinguish between fault activations in different system components and/or different types of faults. The arcs also can represent a variety of system activities besides the fault activations. For example, error propagation or even control flow between the system components can be taken into account. In the next post I will discuss several more advanced Markov-based reliability models.

Wednesday, October 5, 2011

Keep it Simple, Keep it Reliable

Hello dear readers,

In this post I want to continue the discussion on software complexity and software reliability. The main idea has been already defined by the Einstein:

"Everything should be as simple as it is, but not simpler."

Now, let me talk a bit about existing SW reliability models (SRM). The earliest concepts of SW reliability engineering were adapted from the older techniques of HW reliability. However, the application of hardware methods to software has to be done with care, since there are fundamental differences in the nature of hardware and software faults. Since the last 20 years software reliability engineering is a separate domain. H. Pham gave a classification of actual software reliability models in his book "System software reliability". According to it, there are the following groups of SRM: error seeding models, failure rate models, curve fitting models, reliability growth models, time-series models, and non-homogeneous Poisson process models. These models are based on software metrics like lines of codes, number of operators and operands, cyclomatic complexity, object oriented metrics and many others. An overview of SW complexity metrics you can find in: "Object-oriented metrics - a survey" and "A survey of software metrics".

All of the defined SRM are Black-box Models that consider the software as an indivisible entity. A separate domain, which is more interesting for me, contains so-called Architecture-based SRM, like the ones described in: "Architecture-based approach to reliability assessment of software systems" and "An analytical approach to architecture-based software performance and reliability prediction". These type of the models consider software as a system of components with given failure rates or fault activation probabilities (those can be evaluated using the black box models). Reliability of the entire systems can be evaluated by processing information of system architecture, failure behavior, and single component properties. Most of these models are based on probabilistic mathematical frameworks like various Markov chains, stochastic Petri nets, stochastic process algebra, and probabilistic queuing networks. The architecture-based models help not only to evaluate reliability but also to detect unreliable parts of the system.

Returning to the topic, I want to refer to a very trivial principle: The simpler is SW, the more reliable is the SW. This idea is very transparent. The majority of SW faults are actually bugs that have been introduced during SW design or implementation. Complex SW contains more bugs.  Hence, the probability that one of these bugs will be activated is higher. This fact can be proven by any SRM. To make a reliable SW system you have to define a function of this system very strictly and clear and develop the SW just for this function. This principle is too straightforward, but it will help you to obtain a system "as simple as it is, but not simpler".