Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study (2401.03069v4)
Abstract: Context: Deep learning has achieved remarkable progress in various domains. However, like any software system, deep learning systems contain bugs, some of which can have severe impacts, as evidenced by crashes involving autonomous vehicles. Despite substantial advancements in deep learning techniques, little research has focused on reproducing deep learning bugs, which is an essential step for their resolution. Existing literature suggests that only 3% of deep learning bugs are reproducible, underscoring the need for further research. Objective: This paper examines the reproducibility of deep learning bugs. We identify edit actions and useful information that could improve the reproducibility of deep learning bugs. Method: First, we construct a dataset of 668 deep-learning bugs from Stack Overflow and GitHub across three frameworks and 22 architectures. Second, out of the 668 bugs, we select 165 bugs using stratified sampling and attempt to determine their reproducibility. While reproducing these bugs, we identify edit actions and useful information for their reproduction. Third, we used the Apriori algorithm to identify useful information and edit actions required to reproduce specific types of bugs. Finally, we conducted a user study involving 22 developers to assess the effectiveness of our findings in real-life settings. Results: We successfully reproduced 148 out of 165 bugs attempted. We identified ten edit actions and five useful types of component information that can help us reproduce the deep learning bugs. With the help of our findings, the developers were able to reproduce 22.92% more bugs and reduce their reproduction time by 24.35%. Conclusions: Our research addresses the critical issue of deep learning bug reproducibility. Practitioners and researchers can leverage our findings to improve deep learning bug reproducibility.
- D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,” Annual review of biomedical engineering, vol. 19, pp. 221–248, 2017.
- P. M. Addo, D. Guegan, and B. Hassani, “Credit risk analysis using machine and deep learning models,” Risks, vol. 6, no. 2, p. 38, 2018.
- D. S. Berman, A. L. Buczak, J. S. Chavis, and C. L. Corbett, “A survey of deep learning methods for cyber security,” Information, vol. 10, no. 4, p. 122, 2019.
- K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” Commun. ACM, vol. 62, no. 11, p. 137–145, oct 2019. [Online]. Available: https://doi.org/10.1145/3361566
- L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, J. Zhao, and Y. Wang, “Deepgauge: Multi-granularity testing criteria for deep learning systems,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ser. ASE ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 120–131. [Online]. Available: https://doi.org/10.1145/3238147.3238202
- A. Esteva, A. Robicquet, B. Ramsundar, V. Kuleshov, M. DePristo, K. Chou, C. Cui, G. Corrado, S. Thrun, and J. Dean, “A guide to deep learning in healthcare,” Nature medicine, vol. 25, no. 1, pp. 24–29, 2019.
- H. B. Braiek and F. Khomh, “On testing machine learning programs,” Journal of Systems and Software, vol. 164, p. 110542, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121220300248
- M. J. Islam, G. Nguyen, R. Pan, and H. Rajan, “A comprehensive study on deep learning bug characteristics,” ser. ESEC/FSE 2019. New York, NY, USA: Association for Computing Machinery, 2019, p. 510–520. [Online]. Available: https://doi.org/10.1145/3338906.3338955
- D. Wakabayashi, “Self-driving uber car kills pedestrian in arizona, where robots roam,” Mar 2018, accessed on December 17, 2023. [Online]. Available: https://www.nytimes.com/2018/03/19/technology/uber-driverless-fatality.html
- P. Nagarajan, G. Warnell, and P. Stone, “The impact of nondeterminism on reproducibility in deep reinforcement learning,” 2018.
- M. Krishnan, “Against interpretability: a critical examination of the interpretability problem in machine learning,” Philosophy & Technology, vol. 33, no. 3, pp. 487–502, 2020.
- M. M. Morovati, A. Nikanjam, F. Khomh, and Z. M. J. Jiang, “Bugs in machine learning-based systems: A faultload benchmark,” Empirical Softw. Engg., vol. 28, no. 3, apr 2023. [Online]. Available: https://doi.org/10.1007/s10664-023-10291-1
- S. Mondal, M. M. Rahman, and C. K. Roy, “Can issues reported at stack overflow questions be reproduced? an exploratory study,” in 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), 2019, pp. 479–489.
- M. M. Rahman, F. Khomh, and M. Castelluccio, “Why are some bugs non-reproducible? : –an empirical investigation using data fusion–,” in 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2020, pp. 605–616.
- Y. Liang, Y. Lin, X. Song, J. Sun, Z. Feng, and J. S. Dong, “gdefects4dl: a dataset of general real-world deep learning program defects,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 90–94.
- N. Humbatova, G. Jahangirova, G. Bavota, V. Riccio, A. Stocco, and P. Tonella, “Taxonomy of real faults in deep learning systems,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20.
- R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in large databases,” in Proceedings of the 20th International Conference on Very Large Data Bases, ser. VLDB ’94. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1994, p. 487–499.
- S. Exchange, “All sites - stack exchange,” accessed on December 12, 2023. [Online]. Available: https://stackexchange.com/sites?view=list
- H. Zhao, Y. Li, F. Liu, X. Xie, and L. Chen, “State and tendency: an empirical study of deep learning question&answer topics on stack overflow,” Science China Information Sciences, vol. 64, pp. 1–23, 2021.
- T. Zhang, C. Gao, L. Ma, M. Lyu, and M. Kim, “An empirical study of common challenges in developing deep learning applications,” in 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2019, pp. 104–115.
- L. Ponzanelli, A. Mocci, A. Bacchelli, and M. Lanza, “Understanding and classifying the quality of technical forum questions,” in 2014 14th International Conference on Quality Software, 2014, pp. 343–352.
- M. Shah, “mehilshah/bug_reproducibility_dl_bugs,” accessed on January 3, 2024. [Online]. Available: https://github.com/mehilshah/Bug_Reproducibility_DL_Bugs
- B. Liu, W. Hsu, and Y. Ma, “Mining association rules with multiple minimum supports,” in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999, pp. 337–341.
- E. Breck, N. Polyzotis, S. Roy, S. Whang, and M. Zinkevich, “Data validation for machine learning.” in MLSys, 2019.
- M. Soltani, F. Hermans, and T. Bäck, “The significance of bug report elements,” Empirical Software Engineering, vol. 25, pp. 5255–5294, 2020.
- B. Chen and Z. M. J. Jiang, “A survey of software log instrumentation,” ACM Computing Surveys, vol. 54, no. 4, p. 1–34, May 2022. [Online]. Available: https://dl.acm.org/doi/10.1145/3448976
- D. Talwar, S. Guruswamy, N. Ravipati, and M. Eirinaki, “Evaluating validity of synthetic data in perception tasks for autonomous vehicles,” in 2020 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 2020, pp. 73–80.
- R. Grosse, “Lecture 15: Exploding and vanishing gradients,” University of Toronto Computer Science, 2017.
- N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in 2015 IEEE Information Theory Workshop (ITW), Apr. 2015, p. 1–5. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/7133169
- Y. Yang, T. He, Z. Xia, and Y. Feng, “A comprehensive empirical study on bug characteristics of deep learning frameworks,” Information and Software Technology, vol. 151, p. 107004, 2022.
- N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
- M. Shah, “mehilshah/dlrepro: Tool for deep learning bug reproducibility,” accessed on January 3, 2024. [Online]. Available: https://github.com/mehilshah/dlRepro
- S. Mondal, M. M. Rahman, C. K. Roy, and K. Schneider, “The reproducibility of programming-related issues in stack overflow questions,” Empirical Software Engineering, vol. 27, no. 3, p. 62, 2022.
- S. Mondal and B. Roy, “Reproducibility of issues reported in stack overflow questions: Challenges, impact & estimation,” Impact & Estimation.
- M. M. Rahman, F. Khomh, and M. Castelluccio, “Works for me! cannot reproduce–a large scale empirical study of non-reproducible bugs,” Empirical Software Engineering, vol. 27, no. 5, p. 111, 2022.
- Y. Zhang, Y. Chen, S.-C. Cheung, Y. Xiong, and L. Zhang, “An empirical study on tensorflow program bugs,” in Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2018. New York, NY, USA: Association for Computing Machinery, 2018, p. 129–140. [Online]. Available: https://doi.org/10.1145/3213846.3213866
- T. Makkouk, D. J. Kim, and T.-H. P. Chen, “An empirical study on performance bugs in deep learning frameworks,” in 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2022, pp. 35–46.
- F. Jafarinejad, K. Narasimhan, and M. Mezini, “Nerdbug: Automated bug detection in neural networks,” in Proceedings of the 1st ACM International Workshop on AI and Software Testing/Analysis, ser. AISTA 2021. New York, NY, USA: Association for Computing Machinery, 2021, p. 13–16. [Online]. Available: https://doi.org/10.1145/3464968.3468409
- M. Yan, J. Chen, X. Zhang, L. Tan, G. Wang, and Z. Wang, “Exposing numerical bugs in deep learning via gradient back-propagation,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 627–638.
- Y. Zhang, L. Ren, L. Chen, Y. Xiong, S.-C. Cheung, and T. Xie, “Detecting numerical bugs in neural network architectures,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 826–837.
- B. Chen, M. Wen, Y. Shi, D. Lin, G. K. Rajbahadur, and Z. M. J. Jiang, “Towards training reproducible deep learning models,” in Proceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 2202–2214. [Online]. Available: https://doi.org/10.1145/3510003.3510163
- C. Liu, C. Gao, X. Xia, D. Lo, J. Grundy, and X. Yang, “On the reproducibility and replicability of deep learning in software engineering,” ACM Trans. Softw. Eng. Methodol., vol. 31, no. 1, oct 2021. [Online]. Available: https://doi.org/10.1145/3477535
- M. White, M. Linares-Vásquez, P. Johnson, C. Bernal-Cárdenas, and D. Poshyvanyk, “Generating reproducible and replayable bug reports from android application crashes,” in 2015 IEEE 23rd International Conference on Program Comprehension, 2015, pp. 48–59.