Papers
Topics
Authors
Recent
2000 character limit reached

FastFlip: Compositional Error Injection Analysis (2403.13989v2)

Published 20 Mar 2024 in cs.SE

Abstract: Instruction-level error injection analyses aim to find instructions where errors often lead to unacceptable outcomes like Silent Data Corruptions (SDCs). These analyses require significant time, which is especially problematic if developers wish to regularly analyze software that evolves over time. We present FastFlip, a combination of empirical error injection and symbolic SDC propagation analyses that enables fast, compositional error injection analysis of evolving programs. FastFlip calculates how SDCs propagate across program sections and correctly accounts for unexpected side effects that can occur due to errors. Using FastFlip, we analyze five benchmarks, plus two modified versions of each benchmark. FastFlip speeds up the analysis of incrementally modified programs by $3.2\times$ (geomean). FastFlip selects a set of instructions to protect against SDCs that minimizes the runtime cost of protection while protecting against a developer-specified target fraction of all SDC-causing errors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. P. Agrawal, “Fault tolerance in multiprocessor systems without dedicated redundancy,” IEEE Transactions on Computers, 1988.
  2. A. Alali, H. Kagdi, and J. I. Maletic, “What’s a typical commit? a characterization of open source software repositories,” in 16th IEEE International Conference on Program Comprehension, 2008.
  3. R. A. Ashraf, R. Gioiosa, G. Kestor, R. F. DeMara, C.-Y. Cher, and P. Bose, “Understanding the propagation of transient errors in HPC applications,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2015.
  4. C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark suite: Characterization and architectural implications,” in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, 2008.
  5. S. Borkar, “Designing reliable systems from unreliable components: The challenges of transistor variability and degradation,” IEEE Micro, 2005.
  6. J. Bornholt, T. Mytkowicz, and K. S. McKinley, “Uncertain<<<T>>>: A first-order type for uncertain data,” in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, 2014.
  7. B. Boston, Z. Gong, and M. Carbin, “Leto: verifying application-specific hardware fault tolerance with programmable execution models,” in Proceedings of the ACM on Programming Languages, no. OOPSLA, 2018.
  8. D. G. Cacuci and M. Ionescu-Bujor, “A comparative review of sensitivity and uncertainty analysis of large-scale systems - ii: Statistical methods,” Nuclear Science and Engineering, 2004.
  9. J. Calhoun, L. Olson, and M. Snir, “FlipIt: An LLVM based fault injector for HPC,” in European Conference on Parallel Processing Workshops, 2014.
  10. C.-K. Chang, S. Lym, N. Kelly, M. B. Sullivan, and M. Erez, “Hamartia: A fast and accurate error injection framework,” in 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2018.
  11. S. Chaudhuri, S. Gulwani, R. Lublinerman, and S. Navidpour, “Proving programs robust,” in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, 2011.
  12. J. E. R. Condia, P. Rech, F. F. dos Santos, L. Carrot, and M. S. Reorda, “Protecting GPU microarchitectural vulnerabilities via effective selective hardening,” in IEEE 27th International Symposium on On-Line Testing and Robust System Design, 2021.
  13. M. Didehban and A. Shrivastava, “NZDC: A compiler technique for near zero silent data corruption,” in Proceedings of the 53rd Annual Design Automation Conference, 2016.
  14. W. Dweik, M. Annavaram, and M. Dubois, “Reliability-aware exceptions: Tolerating intermittent faults in microprocessor array structures,” in Design, Automation and Test in Europe Conference and Exhibition, 2014.
  15. B. Fang, Q. Lu, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “ePVF: An enhanced program vulnerability factor methodology for cross-layer resilience analysis,” in 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2016.
  16. S. Feng, S. Gupta, A. Ansari, and S. Mahlke, “Shoestring: Probabilistic soft error reliability on the cheap,” in Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010.
  17. V. Fernando, K. Joshi, and S. Misailovic, “Verifying safety and accuracy of approximate parallel programs via canonical sequentialization,” in Proceedings of the ACM on Programming Languages, no. OOPSLA, 2019.
  18. ——, “Diamont: Dynamic monitoring of uncertainty for distributed asynchronous programs,” in International Conference on Runtime Verification, 2021.
  19. S. Hari, S. Adve, and H. Naeimi, “Low-cost program-level detectors for reducing silent data corruptions,” in IEEE/IFIP International Conference on Dependable Systems and Networks, 2012.
  20. S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran, “Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults,” in Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, 2012.
  21. K. Joshi, V. Fernando, and S. Misailovic, “Aloe: Verifying reliability of approximate programs in the presence of recovery mechanisms,” in Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, 2020.
  22. M. Kaliorakis, D. Gizopoulos, R. Canal, and A. Gonzalez, “MeRLiN: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment,” in ACM/IEEE 44th Annual International Symposium on Computer Architecture, 2017.
  23. M. Kaliorakis, S. Tselonis, A. Chatzidimitriou, N. Foutris, and D. Gizopoulos, “Differential fault injection on microarchitectural simulators,” in IEEE International Symposium on Workload Characterization, 2015.
  24. J. Laurel, R. Yang, G. Singh, and S. Misailovic, “A dual number abstraction for static analysis of Clarke Jacobians,” Proceedings of the ACM on Programming Languages, 2022.
  25. G. Li and K. Pattabiraman, “Modeling input-dependent error propagation in programs,” in 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2018.
  26. G. Li, K. Pattabiraman, S. K. S. Hari, M. Sullivan, and T. Tsai, “Modeling soft-error propagation in programs,” in 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2018.
  27. J. Li and Q. Tan, “SmartInjector: Exploiting intelligent fault injection for SDC rate analysis,” in IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, 2013.
  28. M.-L. Li, P. Ramachandran, U. R. Karpuzcu, S. K. S. Hari, and S. V. Adve, “Accurate microarchitecture-level fault modeling for studying hardware faults,” in IEEE 15th International Symposium on High Performance Computer Architecture, 2009.
  29. X. Li, S. V. Adve, P. Bose, and J. A. Rivers, “Online estimation of architectural vulnerability factor for soft errors,” in International Symposium on Computer Architecture, 2008.
  30. F. Libano, B. Wilson, J. Anderson, M. J. Wirthlin, C. Cazzaniga, C. Frost, and P. Rech, “Selective hardening for neural networks in FPGAs,” IEEE Transactions on Nuclear Science, 2019.
  31. Q. Lu, G. Li, K. Pattabiraman, M. S. Gupta, and J. A. Rivers, “Configurable detection of SDC-causing errors in programs,” ACM Transactions on Embedded Computing Systems, 2017.
  32. A. Mahmoud, R. Venkatagiri, K. Ahmed, S. Misailovic, D. Marinov, C. W. Fletcher, and S. V. Adve, “Minotaur: Adapting software testing techniques for hardware errors,” in Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, 2019.
  33. S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. C. Rinard, “Chisel: Reliability- and accuracy-aware optimization of approximate computational kernels,” in Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages & Applications, 2014.
  34. K. Mitropoulou, V. Porpodas, and M. Cintra, “DRIFT: Decoupled compileR-based Instruction-level Fault-Tolerance,” in Languages and Compilers for Parallel Computing, 2014.
  35. A. Mosnier, “SHA-2 algorithm implementations,” https://github.com/amosnier/sha-2.
  36. B. O. Mutlu, G. Kestor, A. Cristal, O. Unsal, and S. Krishnamoorthy, “Ground-truth prediction to accelerate soft-error impact analysis for iterative methods,” in IEEE 26th International Conference on High Performance Computing, Data, and Analytics, 2019.
  37. G. Papadimitriou and D. Gizopoulos, “Demystifying the system vulnerability stack: Transient fault effects across the layers,” in Proceedings of the 48th Annual International Symposium on Computer Architecture, 2021.
  38. ——, “AVGI: Microarchitecture-driven, fast and accurate vulnerability assessment,” in IEEE International Symposium on High-Performance Computer Architecture, 2023.
  39. K. Parasyris, G. Tziantzoulis, C. D. Antonopoulos, and N. Bellas, “GemFI: A fault injection tool for studying the behavior of applications on unreliable substrates,” in 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2014.
  40. I. Polian and J. P. Hayes, “Selective hardening: Toward cost-effective error tolerance,” IEEE Design & Test of Computers, 2011.
  41. G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. August, “SWIFT: software implemented fault tolerance,” in International Symposium on Code Generation and Optimization, 2005.
  42. C. Sakalis, C. Leonardsson, S. Kaxiras, and A. Ros, “Splash-3: A properly synchronized benchmark suite for contemporary research,” in IEEE International Symposium on Performance Analysis of Systems and Software, 2016.
  43. F. F. d. Santos, J. E. R. Condia, L. Carro, M. S. Reorda, and P. Rech, “Revealing GPU vulnerabilities by combining register-transfer and software-level fault injection,” in 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2021.
  44. H. Schirmeier, M. Hoffmann, C. Dietrich, M. Lenz, D. Lohmann, and O. Spinczyk, “FAIL*: An open and versatile fault-injection framework for the assessment of software-implemented hardware fault tolerance,” in 11th European Dependable Computing Conference, 2015.
  45. V. Sridharan and D. R. Kaeli, “Eliminating microarchitectural dependency from architectural vulnerability,” in IEEE 15th International Symposium on High Performance Computer Architecture, 2009.
  46. A. Thomas and K. Pattabiraman, “LLFI: An intermediate code level fault injector for soft computing applications,” in IEEE International Conference on Software Quality, Reliability and Security, 2013.
  47. R. Venkatagiri, K. Ahmed, A. Mahmoud, S. Misailovic, D. Marinov, C. W. Fletcher, and S. V. Adve, “gem5-Approxilyzer: An open-source tool for application-level soft error analysis,” in 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2019.
  48. R. Venkatagiri, A. Mahmoud, S. K. S. Hari, and S. V. Adve, “Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency,” in 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016.
  49. S. Wang, G. Zhang, J. Wei, Y. Wang, J. Wu, and Q. Luo, “Understanding silent data corruptions in a large production CPU population,” in Proceedings of the 29th Symposium on Operating Systems Principles, 2023.
  50. Y. Yao, “CAVA: Camera vision pipeline on gem5-Aladdin,” https://github.com/yaoyuannnn/cava.
  51. A. Ziv and J. Bruck, “Performance optimization of checkpointing schemes with task duplication,” IEEE Transactions on Computers, 1997.
  52. C. G. Zoellin, H.-J. Wunderlich, I. Polian, and B. Becker, “Selective hardening in early design steps,” in 13th European Test Symposium, 2008.

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 2 tweets with 2 likes about this paper.