Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dependability in Embedded Systems: A Survey of Fault Tolerance Methods and Software-Based Mitigation Techniques (2404.10509v1)

Published 16 Apr 2024 in eess.SY and cs.SY

Abstract: Fault tolerance is a critical aspect of modern computing systems, ensuring correct functionality in the presence of faults. This paper presents a comprehensive survey of fault tolerance methods and software-based mitigation techniques in embedded systems. The focus is on real-time embedded systems, considering their resource constraints and the increasing interconnectivity of computing systems in commercial and industrial applications. The survey covers various fault-tolerance methods, including hardware, software, and hybrid redundancy. Particular emphasis is given to software faults, acknowledging their significance as a leading cause of system failures. Moreover, the paper explores the challenges posed by soft errors in modern computing systems. The survey concludes by emphasizing the need for continued research and development in fault-tolerance methods, specifically in the context of real-time embedded systems, and highlights the potential for extending fault-tolerance approaches to diverse computing environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (116)
  1. 2018. ISO 26262:2018 Road Vehicles – Functional Safety.
  2. 2022a. AUTOSAR: Specification of Operating System (2011),. https://www.autosar.org/fileadmin/standards/R20-11/CP/AUTOSAR_SWS_OS.pdf.
  3. 2022b. AUTOSAR: Specification of Watchdog Manager (2011),,. https://www.autosar.org/fileadmin/standards/R21-11/CP/AUTOSAR_SWS_WatchdogManager.pdf.
  4. Control-flow integrity principles, implementations, and applications. ACM Transactions on Information and System Security (TISSEC) 13, 1 (2009), 1–40.
  5. A new mitigation approach for soft errors in embedded processors. In 2007 9th European Conference on Radiation and Its Effects on Components and Systems. 1–6. https://doi.org/10.1109/RADECS.2007.5205504
  6. Jacob A Abraham and Ramtilak Vemu. 2009. Control flow deviation detection for software security. US Patent App. 12/484,839.
  7. Read path degradation analysis in SRAM. In 2016 21th IEEE European Test Symposium (ETS). 1–2. https://doi.org/10.1109/ETS.2016.7519325
  8. A control-theoretic energy management for fault-tolerant hard real-time systems. In 2010 IEEE International Conference on Computer Design. IEEE, 173–178.
  9. Automotive internal-combustion-engine fault detection and classification using artificial neural network techniques. IEEE Transactions on vehicular technology 64, 1 (2014), 21–33.
  10. Design and evaluation of system-level checks for on-line control flow error detection. IEEE Transactions on Parallel and Distributed Systems 10, 6 (1999), 627–641.
  11. 10 - Trends and challenges. In Encapsulation Technologies for Electronic Applications (Second Edition) (second edition ed.), Haleh Ardebili, Jiawei Zhang, and Michael G. Pecht (Eds.). William Andrew Publishing, 431–479. https://doi.org/10.1016/B978-0-12-811978-5.00010-9
  12. Software-based control flow checking against transient faults in industrial environments. IEEE Transactions on Industrial Informatics 10, 1 (2013), 481–490.
  13. Todd M Austin. 1999. DIVA: A reliable substrate for deep submicron microarchitecture design. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 196–207.
  14. Algirdas Avizienis. 1995. The methodology of n-version programming. Software fault tolerance 3 (1995), 23–46.
  15. Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computing 1, 1 (2004), 11–33.
  16. HETA: Hybrid error-detection technique using assertions. IEEE Transactions on Nuclear Science 60, 4 (2013), 2805–2812.
  17. Fault-tolerant platforms for automotive safety-critical applications. In Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems. 170–177.
  18. Wendy Bartlett and Lisa Spainhower. 2004. Commercial fault tolerance: A tale of two systems. IEEE Transactions on dependable and secure computing 1, 1 (2004), 87–96.
  19. Robert C Baumann. 2005. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and materials reliability 5, 3 (2005), 305–316.
  20. A watchdog processor to detect data and control flow errors. In 9th IEEE On-Line Testing Symposium, 2003. IOLTS 2003. IEEE, 144–148.
  21. Alfredo Benso and Paolo Prinetto. 2003. Fault injection techniques and tools for embedded systems reliability evaluation. Vol. 23. Springer Science & Business Media.
  22. NonStop/spl reg/advanced architecture. In 2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE, 12–21.
  23. Vladimir A Bogatyrev and AV Bogatyrev. 2015. Functional reliability of a real-time redundant computational process in cluster architecture systems. Automatic Control and Computer Sciences 49 (2015), 46–56.
  24. Shekhar Borkar et al. 2004. Microarchitecture and design challenges for gigascale integration. In MICRO, Vol. 37. 3–3.
  25. Microprocessors in the Era of Terascale Integration. In 2007 Design, Automation & Test in Europe Conference & Exhibition. 1–6. https://doi.org/10.1109/DATE.2007.364597
  26. N.S. Bowen and D.K. Pradham. 1993. Processor- and memory-based checkpoint and rollback recovery. Computer 26, 2 (1993), 22–31. https://doi.org/10.1109/2.191981
  27. HAIL: A High-Availability and Integrity Layer for Cloud Storage. In Proceedings of the 16th ACM Conference on Computer and Communications Security (Chicago, Illinois, USA) (CCS ’09). Association for Computing Machinery, New York, NY, USA, 187–198. https://doi.org/10.1145/1653662.1653686
  28. Proofs of retrievability: Theory and implementation. In Proceedings of the 2009 ACM workshop on Cloud computing security. 43–54.
  29. A foundation for the accurate prediction of the soft error vulnerability of scientific applications. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).
  30. K Chandrasekaran. 2014. Essentials of cloud computing. CrC Press.
  31. A framework for low overhead hardware based runtime control flow error detection and recovery. In 2013 IEEE 31st VLSI Test Symposium (VTS). IEEE, 1–6.
  32. Hybrid soft error mitigation techniques for COTS processor-based systems. In 2016 17th Latin-American Test Symposium (LATS). 99–104. https://doi.org/10.1109/LATW.2016.7483347
  33. S-SETA: Selective software-only error-detection technique using assertions. IEEE transactions on Nuclear Science 62, 6 (2015), 3088–3095.
  34. Nonlinear code-based low-overhead fine-grained control flow checking. IEEE Trans. Comput. 71, 3 (2021), 658–669.
  35. Steven X Ding. 2008. Model-based fault diagnosis techniques: design schemes, algorithms, and tools. Springer Science & Business Media.
  36. Elena Dubrova. 2013. Fault-tolerant design. Springer.
  37. Fault Injection Methodologies. Springer International Publishing, Cham, 127–144. https://doi.org/10.1007/978-3-030-04660-6_6
  38. Bill Fleming. 2011. Microcontroller Units in Automobiles [Automotive Electronics]. IEEE Vehicular Technology Magazine 6, 3 (2011), 4–8. https://doi.org/10.1109/MVT.2011.941888
  39. Rémi Gaillard. 2010. Single event effects: Mechanisms and classification. In Soft errors in modern electronic systems. Springer, 27–54.
  40. A survey of fault diagnosis and fault-tolerant techniques—Part I: Fault diagnosis with model-based and signal-based approaches. IEEE transactions on industrial electronics 62, 6 (2015), 3757–3767.
  41. Studying the effects of intermittent faults on a microcontroller. Microelectronics Reliability 52, 11 (2012), 2837–2846. https://doi.org/10.1016/j.microrel.2012.06.004
  42. Soft-error detection using control flow assertions. In Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems. IEEE, 581–588.
  43. Improved software-based processor control-flow errors detection technique. In Annual Reliability and Maintainability Symposium, 2005. Proceedings. IEEE, 583–589.
  44. Fast context reloading lockstep approach for SEUs mitigation in a FPGA soft core processor. In IECON 2013 - 39th Annual Conference of the IEEE Industrial Electronics Society. 2261–2266. https://doi.org/10.1109/IECON.2013.6699483
  45. Kim P Gostelow. 2011. The design of a fault-tolerant, real-time, multi-core computer system. In 2011 Aerospace Conference. IEEE, 1–8.
  46. Florian Haas. 2019. Fault-tolerant execution of parallel applications on x86 multi-core processors with hardware transactional memory. (2019).
  47. Reliability challenges of real-time systems in forthcoming technology nodes. In 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE). 129–134. https://doi.org/10.7873/DATE.2013.040
  48. Fault injection techniques and tools. Computer 30, 4 (1997), 75–82. https://doi.org/10.1109/2.585157
  49. Krzysztof Iniewski. 2018. Radiation effects in semiconductors. CRC press.
  50. Casey M. Jeffery and Renato J. O. Figueiredo. 2012. A Flexible Approach to Improving System Reliability with Virtual Lockstep. IEEE Transactions on Dependable and Secure Computing 9, 1 (2012), 2–15. https://doi.org/10.1109/TDSC.2010.53
  51. Fernanda Kastensmidt and Paolo Rech. 2016. Radiation effects and fault tolerance techniques for FPGAs and GPUs. In FPGAs and Parallel Architectures for Aerospace Applications: Soft Errors and Fault-Tolerant Design. Springer, 3–17.
  52. Fault-tolerance techniques for SRAM-based FPGAs. Vol. 1. Springer.
  53. Hans G. Kerkhoff and H. Ebrahimi. 2015. Intermittent Resistive Faults in Digital CMOS Circuits. In 2015 IEEE 18th International Symposium on Design and Diagnostics of Electronic Circuits & Systems. 211–216. https://doi.org/10.1109/DDECS.2015.12
  54. BTI impact on logical gates in nano-scale CMOS technology. In 2012 IEEE 15th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS). 348–353. https://doi.org/10.1109/DDECS.2012.6219086
  55. Daya Shanker Khudia and Scott Mahlke. 2013. Low cost control flow protection using abstract control signatures. In Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems. 3–12.
  56. J.C. Knight. 2002a. Safety critical systems: challenges and directions. In Proceedings of the 24th International Conference on Software Engineering. ICSE 2002. 547–550.
  57. John C Knight. 2002b. Safety critical systems: challenges and directions. In Proceedings of the 24th international conference on software engineering. 547–550.
  58. Single event functional interrupt (SEFI) sensitivity in microcircuits. In RADECS 97. Fourth European Conference on Radiation and its Effects on Components and Systems (Cat. No. 97TH8294). IEEE, 311–318.
  59. Philip Koopman. 2010. Better embedded system software. Drumnadrochit Education Pittsburgh.
  60. Israel Koren and C Mani Krishna. 2020a. Fault-tolerant systems. Morgan Kaufmann.
  61. Israel Koren and C Mani Krishna. 2020b. Fault-tolerant systems. Morgan Kaufmann.
  62. Definition and analysis of hardware-and-software fault-tolerant architectures. In Predictably Dependable Computing Systems. Springer, 103–122.
  63. Fault tolerance. Springer.
  64. Aiguo Li and Bingrong Hong. 2007. Software implemented transient fault detection in space computer. Aerospace science and technology 11, 2-3 (2007), 245–252.
  65. System level approaches for mitigation of long duration transient faults in future technologies. In 12th IEEE European test symposium (ETS’07). IEEE, 165–172.
  66. An efficient adaptive software-implemented technique to detect control-flow errors in multi-core architectures. Microelectronics Reliability 52, 11 (2012), 2812–2828.
  67. Aamer Mahmood and Edward J McCluskey. 1988a. Concurrent error detection using watchdog processors-a survey. IEEE Trans. Comput. 37, 2 (1988), 160–174.
  68. Aamer Mahmood and Edward J McCluskey. 1988b. Concurrent error detection using watchdog processors-a survey. IEEE Trans. Comput. 37, 2 (1988), 160–174.
  69. Lee D McFearin and VS Sukumaran Nair. 1998. Control Flow Checking Using Assertions. Dependable Computing and Fault Tolerant Systems 10 (1998), 183–200.
  70. Shubu Mukherjee. 2011. Architecture design for soft errors. Morgan Kaufmann.
  71. Victor P. Nelson. 1990. Fault-tolerant computing: Fundamental concepts. Computer 23, 7 (1990), 19–25.
  72. Victor F Nicola. 1994. Checkpointing and the modeling of program execution time. University of Twente, Department of Computer Science and Department of ….
  73. SIED: Software implemented error detection. In Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems. IEEE, 589–596.
  74. Hierarchical parallelism control for multigrain parallel processing. In Languages and Compilers for Parallel Computing: 15th Workshop, LCPC 2002, College Park, MD, USA, July 25-27, 2002. Revised Papers 15. Springer, 31–44.
  75. Control-flow checking by software signatures. IEEE transactions on Reliability 51, 1 (2002), 111–122.
  76. ADRIA BARROS DE OLIVEIRA. 2017. Applying Dual-Core Lockstep in Embedded Processors to Mitigate Radiation-induced Soft Errors. Available at https://www.lume.ufrgs.br/bitstream/handle/10183/173785/001061371.pdf?sequence=1.
  77. Thermal Neutrons: a Possible Threat for Supercomputers and Safety Critical Applications. In 2020 IEEE European Test Symposium (ETS). 1–6. https://doi.org/10.1109/ETS48528.2020.9131597
  78. Effect of BTI Degradation on Transistor Variability in Advanced Semiconductor Technologies. IEEE Transactions on Device and Materials Reliability 8, 3 (2008), 519–525. https://doi.org/10.1109/TDMR.2008.2002351
  79. Efficient mitigation of data and control flow errors in microprocessors. IEEE Transactions on Nuclear Science 61, 4 (2014), 1590–1596.
  80. Edward Petersen. 2011. Single event effects in aerospace. John Wiley & Sons.
  81. Hybrid Lockstep Technique for Soft Error Mitigation. IEEE Transactions on Nuclear Science 69, 7 (2022), 1574–1581. https://doi.org/10.1109/TNS.2022.3149867
  82. Low-overhead fault-tolerance technique for a dynamically reconfigurable softcore processor. IEEE Trans. Comput. 62, 6 (2013), 1179–1192. https://doi.org/10.1109/TC.2012.55
  83. Design optimization of time-and cost-constrained fault-tolerant embedded systems with checkpointing and replication. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17, 3 (2009), 389–402.
  84. Dhiraj K Pradhan et al. 1996. Fault-tolerant computer system design. Vol. 132. Prentice-Hall Englewood Cliffs.
  85. Fault Simulation and Emulation Tools to Augment Radiation-Hardness Assurance Testing. IEEE Transactions on Nuclear Science 60, 3 (2013), 2119–2142. https://doi.org/10.1109/TNS.2013.2259503
  86. Fault simulation and emulation tools to augment radiation-hardness assurance testing. IEEE Transactions on Nuclear Science 60, 3 (2013), 2119–2142.
  87. Methods for Fault Tolerance in Networks-on-Chip. ACM Comput. Surv. 46, 1, Article 8 (jul 2013), 38 pages. https://doi.org/10.1145/2522968.2522976
  88. Brian Randell and Jie Xu. 1995. The evolution of the recovery block concept. Software fault tolerance 3 (1995), 1–22.
  89. Security in embedded systems: Design challenges. ACM Transactions on Embedded Computing Systems (TECS) 3, 3 (2004), 461–491.
  90. Steven K Reinhardt and Shubhendu S Mukherjee. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th annual international symposium on Computer architecture. 25–36.
  91. Experimental evaluation of the fail-silent behaviour in programs with consistency checks. In Proceedings of Annual Symposium on Fault Tolerant Computing. IEEE, 394–403.
  92. Control flow checking or not?(for soft errors). ACM Transactions on Embedded Computing Systems (TECS) 18, 1 (2019), 1–25.
  93. Mark Russinovich and Zary Segall. 1995. Fault-tolerance for off-the-shelf applications and hardware. In Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers. IEEE, 67–71.
  94. F Saglietti. 1990. Strategies for the Achievement and Assessment of Software Fault-Tolerance. IFAC Proceedings Volumes 23, 8 (1990), 303–308.
  95. Seeding clouds with trust anchors. In Proceedings of the 2010 ACM workshop on Cloud computing security workshop. 43–46.
  96. Fault-tolerant software reliability modeling. IEEE transactions on Software Engineering 5 (1987), 582–592.
  97. Quantitative analysis of control flow checking mechanisms for soft errors. In Proceedings of the 51st Annual Design Automation Conference. 1–6.
  98. An Experimental Evaluation of Control Flow Checking for Automotive Embedded Applications Compliant With ISO 26262. IEEE Access 11 (2023), 51185–51198. https://doi.org/10.1109/ACCESS.2023.3279731
  99. Wilfredo Torres-Pomales. 2000. Software fault tolerance: A tutorial. (2000).
  100. João P. Trovao. 2019. Trends in Automotive Electronics [Automotive Electronics]. IEEE Vehicular Technology Magazine 14, 4 (2019), 100–109. https://doi.org/10.1109/MVT.2019.2939757
  101. Random additive signature monitoring for control flow error detection. IEEE transactions on Reliability 66, 4 (2017), 1178–1192.
  102. Random additive control flow error detection. In International Conference on Computer Safety, Reliability, and Security. Springer, 220–234.
  103. Radiation effects on embedded systems. Springer Science & Business Media.
  104. Ramtilak Vemu and Jacob Abraham. 2011. CEDA: Control-flow error detection using assertions. IEEE Trans. Comput. 60, 9 (2011), 1233–1245.
  105. Low-cost on-line fault detection using control flow assertions. In 9th IEEE On-Line Testing Symposium, 2003. IOLTS 2003. IEEE, 137–143.
  106. Efficient byzantine fault-tolerance. IEEE Trans. Comput. 62, 1 (2011), 16–30.
  107. A low-cost solution for deploying processor cores in harsh environments. IEEE Transactions on Industrial Electronics 58, 7 (2011), 2617–2626.
  108. A Low-Cost Solution for Deploying Processor Cores in Harsh Environments. IEEE Transactions on Industrial Electronics 58, 7 (2011), 2617–2626. https://doi.org/10.1109/TIE.2011.2134054
  109. Jeffrey M Voas and Gary McGraw. 1997. Software fault injection: inoculating programs against errors. John Wiley & Sons, Inc.
  110. Fan Wang and Vishwani D. Agrawal. 2008. Single Event Upset: An Embedded Tutorial. In 21st International Conference on VLSI Design (VLSID 2008). 429–434. https://doi.org/10.1109/VLSI.2008.28
  111. Torres Wilfredo. 2000. Software fault tolerance: A tutorial. (2000).
  112. Ying C Yeh. 1996. Triple-triple redundant 777 primary flight computer. In 1996 IEEE Aerospace Applications Conference. Proceedings, Vol. 1. IEEE, 293–307.
  113. Path sensitive signatures for control flow error detection. In The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. 62–73.
  114. Analysis and optimization of soft error tolerance strategies for real-time systems. In 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS). IEEE, 55–64.
  115. Control Flow Checking Optimization Based on Regular Patterns Analysis. In 2018 IEEE 23rd Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE, 203–212.
  116. A survey on fault injection techniques. Int. Arab J. Inf. Technol. 1, 2 (2004), 171–186.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com