A Survey of Deep Learning Library Testing Methods (2404.17871v2)
Abstract: In recent years, software systems powered by deep learning (DL) techniques have significantly facilitated people's lives in many aspects. As the backbone of these DL systems, various DL libraries undertake the underlying optimization and computation. However, like traditional software, DL libraries are not immune to bugs, which can pose serious threats to users' personal property and safety. Studying the characteristics of DL libraries, their associated bugs, and the corresponding testing methods is crucial for enhancing the security of DL systems and advancing the widespread application of DL technology. This paper provides an overview of the testing research related to various DL libraries, discusses the strengths and weaknesses of existing methods, and provides guidance and reference for the application of the DL library. This paper first introduces the workflow of DL underlying libraries and the characteristics of three kinds of DL libraries involved, namely DL framework, DL compiler, and DL hardware library. It then provides definitions for DL underlying library bugs and testing. Additionally, this paper summarizes the existing testing methods and tools tailored to these DL libraries separately and analyzes their effectiveness and limitations. It also discusses the existing challenges of DL library testing and outlines potential directions for future research.
- 2010. IEEE Standard Classification for Software Anomalies. IEEE Std 1044-2009 (Revision of IEEE Std 1044-1993) (2010), 1–23. https://doi.org/10.1109/IEEESTD.2010.5399061
- Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks. Journal of Big Data 10, 1 (2023), 1–23.
- Martín Abadi. 2016. TensorFlow: learning functions at scale. In Proceedings of the 21st ACM SIGPLAN international conference on functional programming. 1–1.
- Houssem Ben Braiek and Foutse Khomh. 2020. On testing machine learning programs. Journal of Systems and Software 164 (2020), 110542.
- Ebubekir Buber and DIRI Banu. 2018. Performance analysis and CPU vs GPU comparison for deep learning. In 2018 6th International Conference on Control Engineering & Information Technology (CEIT). IEEE, 1–6.
- Understanding performance problems in deep learning systems. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 357–369.
- Functional Test Generation for AI Accelerators using Bayesian Optimization. In 2023 IEEE 41st VLSI Test Symposium (VTS). IEEE, 1–6.
- Functional criticality analysis of structural faults in AI accelerators. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 12 (2022), 5657–5670.
- Toward understanding deep learning framework bugs. ACM Transactions on Software Engineering and Methodology 32, 6 (2023), 1–31.
- Junjie Chen and Chenyao Suo. 2022. Boosting compiler testing via compiler optimization exploration. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 4 (2022), 1–33.
- Compiler test-program generation via memoized configuration search. In Proc. 45th International Conference on Software Engineering.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- {{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.
- Metamorphic testing: a new approach for generating next test cases. technical report hkust-cs98-01 (1998).
- A comprehensive study on challenges in deploying deep learning based software. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 750–762.
- cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
- IvySyn: Automated Vulnerability Discovery in Deep Learning Frameworks. In 32nd USENIX Security Symposium (USENIX Security 23). 2383–2400.
- European Commission. 2021. Artificial Intelligence Act. https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf
- Martin D Davis and Elaine J Weyuker. 1981. Pseudo-oracles for non-testable programs. In Proceedings of the ACM’81 Conference. 254–257.
- Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435.
- Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. arXiv preprint arXiv:2304.02014 (2023).
- Fuzzing deep-learning libraries via automated relational api inference. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 44–56.
- Validating a deep learning framework by metamorphic testing. In 2017 IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET). IEEE, 28–34.
- An empirical study of fault triggers in deep learning frameworks. IEEE Transactions on Dependable and Secure Computing (2022).
- An Empirical Study on Common Bugs in Deep Learning Compilers. In 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 184–195.
- Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis. 118–128.
- Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022).
- Just-In-Time Defect Prediction for Intellignet Computing Frameworks (in Chinese). Journal of Software 34, 9 (2023), 0–0.
- Bahar Gezici and Ayça Kolukısa Tarhan. 2022. Systematic literature review on software quality for AI-based software. Empirical Software Engineering 27, 3 (2022), 66.
- A survey of deep learning techniques for autonomous driving. Journal of Field Robotics 37, 3 (2020), 362–386.
- Randomized differential testing as a prelude to formal verification. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 621–631.
- Defect Detection for Deep Learning Frameworks Based on Meta Operators (in Chinese). Chinese Journal Of Computers 45, 2 (2022), 240–255.
- Muffin: Testing deep learning libraries via neural architecture fuzzing. In Proceedings of the 44th International Conference on Software Engineering. 1418–1430.
- An empirical study towards characterizing deep learning development and deployment across different frameworks and platforms. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 810–822.
- Audee: Automated testing for deep learning frameworks. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 486–498.
- Characterizing and understanding software security vulnerabilities in machine learning libraries. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 27–38.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Efficient functional in-field self-test for deep learning accelerators. In 2021 IEEE International Test Conference (ITC). IEEE, 93–102.
- Steffen Herbold and Tobias Haar. 2022. Smoke testing for machine learning: simple tests to discover severe bugs. Empirical Software Engineering 27, 2 (2022), 45.
- Fuzzing with code fragments. In 21st USENIX Security Symposium (USENIX Security 12). 445–458.
- Demystifying dependency bugs in deep learning stack. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 450–462.
- Ai benchmark: Running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0–0.
- A comprehensive study on deep learning bug characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 510–520.
- How Do Injected Bugs Affect Deep Learning?. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 793–804.
- Cudasmith: A fuzzer for CUDA compilers. In 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 861–871.
- Understanding and detecting real-world performance bugs. ACM SIGPLAN Notices 47, 6 (2012), 77–88.
- Performance evaluation of cudnn convolution algorithms on nvidia volta gpus. IEEE Access 7 (2019), 70461–70473.
- Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440.
- SkipFuzz: Active Learning-based Input Selection for Fuzzing Deep Learning Libraries. arXiv preprint arXiv:2212.04038 (2022).
- K Ken. [n. d.]. Exclusive: surveillance footage of tesla crash on sf’s bay bridge hours after elon musk announces “self-driving” feature. https://theintercept.com/2023/01/10/tesla-crash-footage-autopilot/
- Nikhil Ketkar and Nikhil Ketkar. 2017. Introduction to keras. Deep learning with python: a hands-on introduction (2017), 97–111.
- Denchmark: A bug benchmark of deep learning-related software. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 540–544.
- Toward functional safety of systolic array-based deep learning hardware accelerators. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 29, 3 (2021), 485–498.
- Trouble-shooting at GAN Point: Improving Functional Safety in Deep Learning Accelerators. IEEE Trans. Comput. (2023).
- Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Maksim Levental and Elena Orlova. 2020. Comparing the costs of abstraction for dl frameworks. arXiv preprint arXiv:2012.07163 (2020).
- Hang Li. 2018. Deep learning for natural language processing: advantages and challenges. National Science Review 5, 1 (2018), 24–26.
- MMOS: Multi-Staged Mutation Operator Scheduling for Deep Learning Library Testing. In GLOBECOM 2022-2022 IEEE Global Communications Conference. IEEE, 6103–6108.
- Comet: Coverage-guided model generation for deep learning library testing. ACM Transactions on Software Engineering and Methodology 32, 5 (2023), 1–34.
- The deep learning compiler: A comprehensive survey. IEEE Transactions on Parallel and Distributed Systems 32, 3 (2020), 708–727.
- StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
- ALPHAPROG: reinforcement generation of valid programs for compiler fuzzing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 12559–12565.
- Fault localization with code coverage representation learning. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 661–673.
- Fuzz testing in practice: Obstacles and solutions. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 562–566.
- Many-core compiler fuzzing. ACM SIGPLAN Notices 50, 6 (2015), 65–76.
- Software vulnerability discovery techniques: A survey. In 2012 fourth international conference on multimedia information networking and security. IEEE, 152–156.
- TBEM: Testing-Based GPU-Memory Consumption Estimation for Deep Learning. IEEE Access 10 (2022), 39674–39680.
- Nnsmith: Generating diverse and valid test cases for deep learning compilers. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 530–543.
- Neuri: Diversifying dnn generation via inductive rule inference. arXiv preprint arXiv:2302.02261 (2023).
- Coverage-guided tensor compiler fuzzing with joint ir-pass mutation. Proceedings of the ACM on Programming Languages 6, OOPSLA1 (2022), 1–26.
- Benchmarking deep learning frameworks: Design considerations, metrics and beyond. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1258–1269.
- Taxonomy of Aging-related Bugs in Deep Learning Libraries. In 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 423–434.
- Fuzzing Deep Learning Compilers with HirGen. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 248–260.
- A Survey on Testing of Deep Learning Frameworks (in Chinese). Journal of Software (2023).
- The art, science, and engineering of fuzzing: A survey. IEEE Transactions on Software Engineering 47, 11 (2019), 2312–2331.
- Software engineering for AI-based systems: a survey. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2 (2022), 1–59.
- William M McKeeman. 1998. Differential testing for software. Digital Technical Journal 10, 1 (1998), 100–107.
- Improving fault localization and program repair with deep semantic features and transferred knowledge. In Proceedings of the 44th International Conference on Software Engineering. 1169–1180.
- Microsoft. 2023. ONNX Github repository. https://github.com/onnx/onnx
- Using JML runtime assertion checking to automate metamorphic testing in applications without test oracles. In 2009 International Conference on Software Testing Verification and Validation. IEEE, 436–445.
- Interpretable ML enhanced CNN Performance Analysis of cuBLAS, cuDNN and TensorRT. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing. 1260–1265.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
- Caramel: Detecting and fixing performance problems that have non-intrusive fixes. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 902–912.
- Peter Oehlert. 2005. Violating assumptions with fuzzing. IEEE Security & Privacy 3, 2 (2005), 58–62.
- Office of Science and Technology Policy. [n. d.]. National Artificial Intelligence Initiative Act of 2020. https://www.ai.gov/wp-content/uploads/2023/04/National-Artificial-Intelligence-Initiative-Act-of-2020.pdf
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
- Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021).
- CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1027–1038.
- Alexander Prochnow and Jinqiu Yang. 2022. DiffWatch: watch out for the evolving differential testing in deep learning libraries. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 46–50.
- Towards understanding the faults of javascript-based deep learning systems. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.
- Effective Random Test Generation for Deep Learning Compilers. arXiv preprint arXiv:2302.00842 (2023).
- Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907 (2018).
- Richard Schumi and Jun Sun. 2022. ExAIS: executable AI semantics. In Proceedings of the 44th International Conference on Software Engineering. 859–870.
- A survey on metamorphic testing. IEEE Transactions on software engineering 42, 9 (2016), 805–824.
- A comprehensive study of deep learning compiler bugs. In Proceedings of the 29th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering. 968–980.
- ACETest: Automated Constraint Extraction for Testing Deep Learning Operators. arXiv preprint arXiv:2305.17914 (2023).
- ABC7.com staff. [n. d.]. Uber gives up testing of self-driving cars in California in wake of fatal Arizona crash. https://abc7.com/self-driving-uber-crash-video-pedestrian-hit-by-car-autonomous-vehicles/3269690/
- Causality-based neural network repair. In Proceedings of the 44th International Conference on Software Engineering. 338–349.
- Y Sun. 2020. Tesla and PyTorch: PyTorch Developer Conference Highlights. https://medium.com/data-science-bootcamp/tesla-and-pytorch-pytorch-developer-conference-highlights-part-3ed36f2c9d5e
- An automatic testing approach for compiler based on metamorphic testing technique. In 2010 Asia Pacific Software Engineering Conference. IEEE, 270–279.
- TensorFlow. 2020. Learn how TensorFlow solves real, everyday machine learning problems. https://www.tensorflow.org/about/case-studies
- Automated directed fairness testing. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 98–108.
- Achieving automotive safety requirements through functional in-field self-test for deep learning accelerators. In 2022 IEEE International Test Conference (ITC). IEEE, 465–473.
- Accuracy measurement of deep neural network accelerator via metamorphic testing. In 2020 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 55–61.
- EAGLE: creating equivalent graphs to test deep learning libraries. In Proceedings of the 44th International Conference on Software Engineering. 798–810.
- Qdiff: Differential testing of quantum software stacks. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 692–704.
- Deep learning library testing via effective model generation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 788–799.
- Free lunch for testing: Fuzzing deep-learning libraries from open source. In Proceedings of the 44th International Conference on Software Engineering. 995–1007.
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
- Jitfuzz: Coverage-guided fuzzing for jvm just-in-time compilers. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 56–68.
- Metamorphic testing of deep learning compilers. Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, 1 (2022), 1–28.
- DocTer: documentation-guided fuzzing for testing deep learning API functions. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 176–188.
- Application of metamorphic testing to supervised classifiers. In 2009 Ninth International Conference on Quality Software. IEEE, 135–144.
- Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software 84, 4 (2011), 544–558.
- Fuzzing automatic differentiation in deep-learning libraries. arXiv preprint arXiv:2302.04351 (2023).
- Finding and understanding bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation. 283–294.
- A comprehensive empirical study on bug characteristics of deep learning frameworks. Information and Software Technology 151 (2022), 107004.
- Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering 48, 1 (2020), 1–36.
- An empirical study on program failures of deep learning jobs. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1159–1170.
- Duo: Differential fuzzing for deep learning operators. IEEE Transactions on Reliability 70, 4 (2021), 1671–1685.
- The Testing and Repairing Methods for Machine Learning Model Security. ACTA ELECTONICA SINICA 50, 12 (2022), 2884.
- Predoo: precision testing of deep learning operators. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. 400–412.
- Software engineering practice in the development of deep learning applications. arXiv preprint arXiv:1910.03156 (2019).
- Autotrainer: An automatic dnn training problem detection and repair system. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 359–371.
- An empirical study on TensorFlow program bugs. In Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis. 129–140.
- Deepbackground: Metamorphic testing for deep-learning-driven image recognition systems accompanied by background-relevance. Information and Software Technology 140 (2021), 106701.
- Hao Zhong. 2022. Enriching compiler testing with real program from bug report. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.
- Deep learning framework testing via hierarchical and heuristic model generation. Journal of Systems and Software 201 (2023), 111681.
- Xiaoyu Zhang (144 papers)
- Weipeng Jiang (10 papers)
- Chao Shen (168 papers)
- Qi Li (354 papers)
- Qian Wang (453 papers)
- Chenhao Lin (36 papers)
- Xiaohong Guan (62 papers)