Considerations Regarding The Design And Reliability Analysis Of Safety Critical Systems
Considerations Regarding the Design and Reliability Analysis of Safety Critical Systems
VARI-KAKAS Ștefan, POSZET Otto
University of Oradea, Romania,
Department of Computers and Information Technology,
Faculty of Electrical Engineering and Information Technology,
Universității Street, No. 1, 410087, Oradea, Romania,
E-Mail: {vari, poszet}@uoradea.ro
Abstract – In the design process of a safety-critical system more rigorous parameters should be taken into account than in normal systems. These refer mainly to the reliability to be achieved by the computerized system. A preferred solution to attend the needs is the use of fault tolerant architectures. The specification of the reliability requirements could lead to different interpretations during the design, with different solutions. We give a case study in the case of a supposed fault tolerant control system for aircrafts and discuss the consequences of respecting the imposed specifications.
Keywords: safety-critical; reliability; failure rate; fault tolerance; triple modular redundancy; mission time.
I. INTRODUCTION
Nowadays there are more and more applications where the desired functions of the system are assured via embedded computers. There is a category of such embedded systems when the system is used in an environment that could endanger human life or property. These systems are called safety-critical systems, characterized by the fact that a failure can lead to injury, death of persons, or serious property or environmental damage. Typical examples are in space exploration, avionics, nuclear and chemical plants, healthcare technology, automotive industry. In the development of safety-critical systems there is a need to achieve a level of safety that corresponds to the acceptable risk [1].
In the design of safety-critical systems there are necessary supplementary measures for the mitigation of the consequences of failures. The set of failures that can put the system in an unsafe state and lead to an accident are critical failures. As a quantitative measure, safety is the probability that no critical failure will occur during system operation over a specified period of time. It is to be noticed, that this definition is similar to the definition of reliability: the probability that the system will operate correctly for a period of time under the specified conditions. So, if we take into account only the critical failures, the reliability assessment is the same with the safety assessment – this is what we will use in this paper. The safety problem can be treated as a reliability problem for critical failures, and thus safety enhancement is obtainable through reliability enhancement.
A widely used design method for reliability improvement of computer systems is the use of fault tolerance [2]. This is a built-in capability of the system to provide continuous correct behavior in the presence of a limited number of hardware and software faults. In the case of a safety-critical system should be considered only the faults that can lead to critical failure, which could be identified by failure modes and effect analysis [3].
In the case of safety-critical system it is crucial to formulate clear requirements, including safety specifications for the designer [4]. In the rest of this paper we will investigate how these specifications can be assured in the design and what is the effect of restricting some of the specifications to the mission time of the system.
II. DESIGN PROBLEMS OF SAFETY-CRITICAL SYSTEMS
The design of a new system should begin with the analysis phase, during which the goals of the system are identified with respect to the interaction with its environment. In this step the specification of the target system will be obtained, which contains both functional and non-functional requirements.
The non-functional requirements refer, among others, to the safety constrains to be achieved. These should be established in accordance with the legislation, particular regulations, standards and other relevant guidelines. An exact specification should be consistent (realistic demands), complete (behavior for all input combinations) and authoritative (without doubts).
For a safety-critical system it is important to investigate the specifications with respect to the needed safety requirements [5]. It is necessary to identify all the critical failures and the special features to be included in the system in order to meet safety. The usual methods for this examination are fault tree analysis, Markov models, failure mode, effects and criticality analysis, common mode failure analysis.
The next step consists in the hardware and software design of the target system based on the specified functional and non-functional requirements and interface definitions. During this activity principles and techniques for reducing the potential consequences of critical failures should be used. These techniques include mainly some form of redundancy to achieve fault tolerance, but at the other extreme there are cases when a fail-stop operation is sufficient.
Finally, the resulted design documents and codes should be validated with respect to the specifications. These proofs must contain besides functionality checks other verifications like failure analysis, availability and maintainability checks, integrity and robustness verification, and also the behavior in case of external threats has to be considered. There are a lot of computer simulation tools to be used in the validation phase. If not all the requirements are confirmed, the design is returned to the previous state for corrections.
III. AN APPLICATION OF FAULT TOLERANCE FOR SAFETY-CRITICAL SYSTEMS
In the fly-by-wire avionics, the pilot’s commands generate electronic signals which are processed by a flight control computer (FCC) and sent to actuators by wires [6]. The first commercial airliner to fly with this technology was the Airbus 320 in 1987, followed by Boeing’s 777 in 1994. The aviation regulations (FAR/JAR, EASA) impose strict safety and availability requirements and give a list of equipment and functions which need to be operative for safe flight and landing [7][8]. The most important specifications provides that
– the probability of a critical failure needs to be less than 10–9 per flight hour, and
– the control system should operate even with one defective component.
Besides these conditions there are real time requirements, prescriptions for robustness and the possibility of some “gracefully degrading” services but to allow safe landing. The value of 10–9 is also expressed as 1 FIT (failure in time).
To satisfy these requirements, the aircraft manufacturers use fault tolerant architectures for the design of the flight control system (FCS). The Airbus uses five self-checking FCCs (3 primary, 2 secondary), and each channel has separate hardware and different software (only for critical functionality in secondary computers) [9]. The Boeing solution is based on a triple-triple redundant architecture with three identical primary flight computer (PFC) [10]. Each PFC is composed of three dissimilar computing lanes. The output is chosen by majority voting.
A possible fault tolerant FCS hypothetical architecture could be that one presented in fig.1.b. (In fig.1.a there is for comparison the non-tolerant simplex system.) This architecture is based on the triple modular redundancy (TMR) scheme, disposed on two hierarchical levels. In the first level each TMR can tolerate the failure of a FCC due to the majority voting (V). At the upper level the voting upon the data received on the redundant buses is performed by a flux summing circuit [11], which makes possible the operation with only two TMR groups. In consequence the minimal number of needed functional FCCs is four (two in each of two groups).
a) the simplex system
b) the fault tolerant system
FIG.1. The block diagram of the FCS
IV. NUMERICAL RESULTS AND DISCUSSION
For the simplicity of the calculations we assume that the probability of a critical failure (referred in section III) is 10–3 per hour (that is 106 FIT).
By denoting with R the reliability of a FCC, this is equal with the reliability of the simplex system. For the fault tolerant case, the reliability of a first level TMR is given as the sum of the probabilities corresponding to the four possible operational states:
RTMR=R3+3R2(1–R) (1)
In a similar manner can be expressed the reliability of the whole FCS as the second level TMR:
RFCS=RTMR3+3RTMR2(1–RTMR) (2)
From equations (1) and (2) after the substitution we obtain:
RTMR=27R4–36R5–42R6+108R7–72R8+16R9 (3)
Being an electronic equipment, the failure time of a FCC can be assumed to have an exponential distribution, so
R=R(t)=e–t (4)
where =constant is the corresponding failure rate. The mean time to failure (MTTF) is given by the well-known formula:
(5)
Substituting (4) in (3), the time function for the reliability of the FCS is:
RFCS = RFCS(t) = 27e–4t – 36 e–5t – 42 e–6t +
+ 108 e–7t – 72 e–8t + 16 e–9t (6)
The comparative graphs of the reliability functions for the two systems are presented in Fig.2. It is clear that the fault tolerant property (obtained by hardware redundancy) does not improve the reliability on the whole interval t[0,+), so we can conclude that
the reliability calculations make sense only in the time interval that covers the mission time of the system, and
MTTFFCS is not relevant, because it is computed on the entire time axis:
(7)
MTTF of the fault tolerant system is approximately with 25% shorter that the MTTF of the simplex solution. And here comes one of our questions to be discussed: if the value of the “probability of a failure per hour” is imposed in the specifications, can we interpret its inverse as the expected life time (MTTFFCS) of the system? Probably not, because of the previous considerations, and the fact that the failure rate of the fault tolerant system is not constant, as we will see below.
FIG.2. The reliability function
The failure rate of the fault tolerant FCS can be obtained as:
= (8)
It is important to notice that FCS=FCS(t) so, unlike the simplex case, the failure rate of the fault tolerant system is not constant, it is variable with time (Fig.3). In the beginning the failure rate is growing rapidly and after that it tends asymptotically to the value of 4.
FIG.3. The normalized failure rate of the FCS
At this point of the calculations some important conclusions can be formulated regarding the interpretation of the reliability specifications. Let us see which should be the value of to be used in the design of FCCs for different interpretations of the specifications:
If we consider that the specified “probability of a failure per hour” represents a failure rate of FCS=10–3 h–1 which is equivalent to h, according to (7) one FCC should be designed to have a failure rate of
(9)
In the case that we assume that the imposed parameter represents the asymptotic value of FCS(t) which is , than the desired value for will be
(10)
that is three times less than in case 1, which requires a more demanding design for the FCCs.
If we consider the specified “probability of a critical failure” represents the probability of a failure (i. e. the unreliability) after 1 hour of flight, which is RFCS(1 h)=1–10–3=0.999 than after solving equation (6) for t=1 h we obtain
= (11)
that is a value bigger with more than two orders of magnitude as is the previous cases.
Concentrating further on the number of flight hours as in case 3, for a simplex system the desired value of for R(1 h)=0.999 is (from equation (4)) =1000.510–6 h–1. Supposing this failure rate for a FCC, in Table 1 there are the comparative unreliability values for the simplex and the fault tolerant flight control system for different flight hours (mission times). We can observe the serious decrease in the probability of a critical failure for the fault tolerant case compared to the simplex case of using only one computer. This probability is approximately four million times bigger for the simplex system after one hour. It is true that the difference is reducing in time, but the value remains approximately 40000 times bigger even after 10 hours. This demonstrates the significant contribution of fault tolerance to the reliability and implicitly to safety, and it becomes clear, that the computing of the related probability measures should be concentrated on the mission time (of the order of n n10 hours in this example) and not on the whole time axis.
TABLE 1. The comparative study of unreliability
V. CONCLUSIONS
In this paper we made a short description of the design for reliability of safety-critical systems. As some reliability parameter is imposed in the specifications, this should be unambiguous interpreted and taken into account during the design. We investigated a hypothetic case study and showed that different interpretations could lead to very discrepant results. We compared the fault tolerant case with the non-tolerant one, and concluded on the important benefits of such a solution for the reliability and implicitly the safety of the system.
REFERENCES
Kopetz H., Twelve Principles for the Design of Safety-Critical Real-Time Systems (lecture notes), TU Vienna, 2004.
Siewiorek D., Narasimhan P., Fault Tolerant Architectures for Space and Avionics, 49th IFIP Meeting, 2006.
Chunsheng Yang, Yanni Zou, Pinhua Lai, Nan Jiang, Data mining-based methods for fault isolation with validated FMEA model ranking, Applied Intelligence, 2015.
Forsberg K. et al., Elaboration of Safety Requirements, 32nd Avionics Systems Conference, 2013.
Dunn R. William, Practical Design of Safety-Critical Computer Systems, Reliability Press, 2002.
Sghairi M. et al., Challenges in Building Fault-Tolerant Flight Control System for a Civil Aircraft, IJCS, 2008.
Balas G. J., Flight Control Law Design. An Industry Perspective, European Journal of Control, Vol. 9, 2003, pp.207-226.
ARAS Aytaç, European Aviation safety regulatory framework and Turkey: A critical analysis, University of Turkish Aeronautical Association, Faculty of Business Working Papers, October 2011.
Brière D., Favre C., Traverse P., A family of fault-tolerant systems: electrical flight controls, from Airbus A320/330/340 to future military transport aircraft, Microprocessors and Microsystems, 1995.
Yeh Y. C., Design Considerations in Boeing 777 Fly-By-Wire Computers, IEEE HASES, 1998.
Johnson B. W., Design and analysis of fault-tolerant digital systems, Addison-Wesley, 1989.
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Considerations Regarding The Design And Reliability Analysis Of Safety Critical Systems (ID: 112303)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
