Isolating Compiler Bugs by Generating Effective Witness Programs with Large Language Models (2307.00593v3)
Abstract: Compiler bugs pose a significant threat to safety-critical applications, and promptly as well as effectively isolating these bugs is crucial for assuring the quality of compilers. However, the limited availability of debugging information on reported bugs complicates the compiler bug isolation task. Existing compiler bug isolation approaches convert the problem into a test program mutation problem, but they are still limited by ineffective mutation strategies or high human effort requirements. Drawing inspiration from the recent progress of pre-trained LLMs, such as ChatGPT, in code generation, we propose a new approach named LLM4CBI to utilize LLMs to generate effective test programs for compiler bug isolation. However, using LLMs directly for test program mutation may not yield the desired results due to the challenges associated with formulating precise prompts and selecting specialized prompts. To overcome the challenges, three new components are designed in LLM4CBI. First, LLM4CBI utilizes a program complexity-guided prompt production component, which leverages data and control flow analysis to identify the most valuable variables and locations in programs for mutation. Second, LLM4CBI employs a memorized prompt selection component, which adopts reinforcement learning to select specialized prompts for mutating test programs continuously. Third, a test program validation component is proposed to select specialized feedback prompts to avoid repeating the same mistakes during the mutation process. Compared with state-of-the-art approaches over 120 real bugs from GCC and LLVM, our evaluation demonstrates the advantages of LLM4CBI: It can isolate 69.70%/21.74% and 24.44%/8.92% more bugs than DiWi and RecBi within Top-1/Top-5 ranked results. We also demonstrate that the LLMs component used in LLM4CBI can be easily replaced while still achieving reasonable results.
- J. Chen, J. Patra, M. Pradel, Y. Xiong, H. Zhang, D. Hao, and L. Zhang, “A survey of compiler testing,” ACM Computing Surveys (CSUR), vol. 53, no. 1, pp. 1–36, 2020.
- X. Yang, Y. Chen, E. Eide, and J. Regehr, “Finding and understanding bugs in c compilers,” in Proceedings of the 32nd ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), 2011, pp. 283–294.
- J. Chen, J. Han, P. Sun, L. Zhang, D. Hao, and L. Zhang, “Compiler bug isolation via effective witness test program generation,” in Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2019, pp. 223–234.
- J. Chen, H. Ma, and L. Zhang, “Enhanced compiler bug isolation via memoized search,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2020, pp. 78–89.
- OpenAI. (2022) ChatGPT: Optimizing language models for dialogue. [Online]. Available: https://openai.com/blog/chatgpt/
- Z. Liu, C. Chen, J. Wang, X. Che, Y. Huang, J. Hu, and Q. Wang, “Fill in the blank: Context-aware automated text input generation for mobile gui testing,” arXiv preprint arXiv:2212.04732, 2022.
- R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “On the accuracy of spectrum-based fault localization,” in Testing: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION), 2007, pp. 89–98.
- W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A survey on software fault localization,” IEEE Transactions on Software Engineering, vol. 42, no. 8, pp. 707–740, 2016.
- I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
- Y. Liu, “Fine-tune bert for extractive summarization,” arXiv preprint arXiv:1903.10318, 2019.
- Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019.
- J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation,” arXiv preprint arXiv:2305.01210, 2023.
- P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Comput. Surv., vol. 55, no. 9, pp. 195:1–195:35, 2023. [Online]. Available: https://doi.org/10.1145/3560815
- J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with ChatGpt,” arXiv preprint arXiv:2302.11382, 2023.
- L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” in Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–7.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- C. Niu, C. Li, V. Ng, D. Chen, J. Ge, and B. Luo, “An empirical comparison of pre-trained models of source code,” arXiv preprint arXiv:2302.04026, 2023.
- C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in Proceedings of the ACM/IEEE 45th International Conference on Software Engineering (ICSE), 2023, pp. 1–12.
- Z. Fan, X. Gao, A. Roychoudhury, and S. H. Tan, “Automated repair of programs from large language models,” arXiv preprint arXiv:2205.10583, 2022.
- Y. Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang, “Fuzzing deep-learning libraries via large language models,” arXiv preprint arXiv:2212.14834, 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. [Online]. Available: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman et al., “Multipl-e: a scalable and polyglot approach to benchmarking neural code generation,” IEEE Transactions on Software Engineering, pp. 1–17, 2023.
- E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” Proceedings of the 11th International Conference on Learning Representations (ICLR), 2023.
- D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model for code infilling and synthesis,” Proceedings of the 7th International Conference on Learning Representations (ICLR), pp. 1–26, 2022.
- F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (ISMP), 2022, pp. 1–10.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.
- W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* ChatGpt quality,” 2023.
- Y. Anand, Z. Nussbaum, B. Duderstadt, B. Schmidt, and A. Mulyar, “GPT4All: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo,” https://github.com/nomic-ai/gpt4all, 2023.
- T. J. McCabe, “A complexity measure,” IEEE Transactions on Software Engineering, no. 4, pp. 308–320, 1976.
- C. D. Newman, T. Sage, M. L. Collard, H. W. Alomari, and J. I. Maletic, “srcslice: A tool for efficient static forward slicing,” in Proceedings of the 38th International Conference on Software Engineering Companion (SEC), 2016, pp. 621–624.
- L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” Journal of artificial intelligence research, vol. 4, pp. 237–285, 1996.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
- R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in neural information processing systems, vol. 12, 1999.
- V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in neural information processing systems, vol. 12, 1999.
- V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proceedings of the International Conference On Machine Learning (ICML), 2016, pp. 1928–1937.
- I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska, “A survey of actor-critic reinforcement learning: Standard and natural policy gradients,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 42, no. 6, pp. 1291–1307, 2012.
- F. Kirchner, N. Kosmatov, V. Prevosto, J. Signoles, and B. Yakobowski, “Frama-c: A software analysis perspective,” Formal aspects of computing, vol. 27, no. 3, pp. 573–609, 2015.
- Z. Zhou, H. Jiang, Z. Ren, Y. Chen, and L. Qiao, “Locseq: Automated localization for compiler optimization sequence bugs of llvm,” IEEE Transactions on Reliability, vol. 71, no. 2, pp. 896–910, 2022.
- J. Yang, Y. Yang, M. Sun, M. Wen, Y. Zhou, and H. Jin, “Isolating compiler optimization faults via differentiating finer-grained options,” in IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2022, pp. 481–491.
- R. Laboratories. (2023) Oclint. [Online]. Available: https://github.com/oclint
- GNU. (2023) Gcov. [Online]. Available: https://gcc.gnu.org/onlinedocs/gcc/Gcov.html
- P. Foundation. (2023) Pytorch. [Online]. Available: https://pytorch.org
- H. Tu, H. Jiang, Z. Zhou, Y. Tang, Z. Ren, L. Qiao, and L. Jiang, “Detecting C++ compiler front-end bugs via grammar mutation and differential testing,” IEEE Transactions on Reliability, pp. 1–15, 2022.
- H. Jiang, Z. Zhou, Z. Ren, J. Zhang, and X. Li, “CTOS: Compiler testing for optimization sequences of llvm,” IEEE Transactions on Software Engineering, vol. 48, no. 7, pp. 2339–2358, 2021.
- H. Tu, H. Jiang, X. Li, Z. Ren, Z. Zhou, and L. Jiang, “RemGen: Remanufacturing a random program generator for compiler testing,” in IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE), 2022, pp. 529–540.
- D. Jeffrey, N. Gupta, and R. Gupta, “Fault localization using value replacement,” in Proceedings of the 2008 ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2008, pp. 167–178.
- S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, “Evaluating and improving fault localization,” in Proceedings of the 39th IEEE/ACM International Conference on Software Engineering (ICSE), 2017, pp. 609–620.
- W. Ma, S. Liu, W. Wang, Q. Hu, Y. Liu, C. Zhang, L. Nie, and Y. Liu, “The scope of ChatGPT in software engineering: A thorough investigation,” arXiv preprint arXiv:2305.12138, 2023.
- C. Sun, V. Le, Q. Zhang, and Z. Su, “Toward understanding compiler bugs in gcc and llvm,” in Proceedings of the 25th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2016, pp. 294––305.
- E. Almazrouei, A. Cappelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Heslow, J. Launay, Q. Malartic, B. Noune, B. Pannier, and G. Penedo, “Falcon-40b: an open large language model with state-of-the-art performance,” Hugging Face, 2023.
- G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay, “The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only,” arXiv preprint arXiv:2306.01116, 2023.
- Y. Yu, J. A. Jones, and M. J. Harrold, “An empirical study of the effects of test-suite reduction on fault localization,” in Proceedings of the 30th International Conference on Software Engineering (ICSE), 2008, pp. 201–210.
- X. Xu, V. Debroy, W. Eric Wong, and D. Guo, “Ties within fault localization rankings: Exposing and addressing the problem,” International Journal of Software Engineering and Knowledge Engineering, vol. 21, no. 06, pp. 803–827, 2011.
- M. Wen, J. Chen, Y. Tian, R. Wu, D. Hao, S. Han, and S.-C. Cheung, “Historical spectrum based fault localization,” IEEE Transactions on Software Engineering, vol. 47, no. 11, pp. 2348–2368, 2019.
- J. Lee, Y. Kim, Y. Song, C.-K. Hur, S. Das, D. Majnemer, J. Regehr, and N. P. Lopes, “Taming undefined behavior in llvm,” ACM SIGPLAN Notices, vol. 52, no. 6, pp. 633–647, 2017.
- X. Wang, N. Zeldovich, M. F. Kaashoek, and A. Solar-Lezama, “Towards optimization-safe systems: Analyzing the impact of undefined behavior,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP), 2013, pp. 260–275.
- A. Zeller, “Isolating cause-effect chains from computer programs,” ACM SIGSOFT Software Engineering Notes, vol. 27, no. 6, pp. 1–10, 2002.
- J. Holmes and A. Groce, “Causal distance-metric-based assistance for debugging after compiler fuzzing,” in IEEE 29th International Symposium on Software Reliability Engineering (ISSRE), 2018, pp. 166–177.
- H. Josie and G. Alex, “Using mutants to help developers distinguish and debug (compiler) faults,” Software Testing, Verification and Reliability, vol. 30, no. 2, p. e1727, 2020.
- B.-Y. E. Chang, A. Chlipala, G. C. Necula, and R. R. Schneck, “Type-based verification of sssembly language for compiler debugging,” in Proceedings of the ACM International Workshop on Types in Languages Design and Implementation, 2005, pp. 91–102.
- H. Lim and S. Debray, “Automatically localizing dynamic code generation bugs in jit compiler back-end,” in Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction (CC), 2023, pp. 145–155.
- H. Lim and S. K. Debray, “Automated bug localization in jit compilers,” Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2021.
- W. Ling, P. Blunsom, E. Grefenstette, K. M. Hermann, T. Kočiskỳ, F. Wang, and A. Senior, “Latent predictor networks for code generation,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Long Papers), 2016, pp. 599–609.
- S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer, “Mapping language to code in programmatic context,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 1643–1652.
- Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics, 2020, pp. 1536–1547.
- D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, J. Yin, D. Jiang, and M. Zhou, “Graphcodebert: Pre-training code representations with data flow,” Proceedings of the 11th International Conference on Learning Representations (ICLR), vol. abs/2009.08366, 2020.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi, “Codet5+: Open code large language models for code understanding and generation,” arXiv preprint arXiv:2305.07922, 2023.
- W. U. Ahmad, S. Chakraborty, B. Ray, and K. Chang, “Unified pre-training for program understanding and generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, 2021, pp. 2655–2668.
- C. B. Harris and I. G. Harris, “Glast: Learning formal grammars to translate natural language specifications into hardware assertions,” in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016, pp. 966–971.
- S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan-Gavitt, and S. Garg, “Benchmarking large language models for automated verilog rtl code generation,” in 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2023, pp. 1–6.
- C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang, “Wizardlm: Empowering large language models to follow complex instructions,” arXiv e-prints, pp. 1–39, 2023.
- Stability-AI. (2023) Stablelm: Stability ai language models. [Online]. Available: https://github.com/Stability-AI/StableLM
- L. Gong, D. Lo, L. Jiang, and H. Zhang, “Interactive fault localization leveraging simple user feedback,” in Proceedings of the 28th IEEE International Conference on Software Maintenance (ICSM), 2012, pp. 67–76.
- Haoxin Tu (3 papers)
- Zhide Zhou (1 paper)
- He Jiang (58 papers)
- Imam Nur Bani Yusuf (7 papers)
- Yuxian Li (5 papers)
- Lingxiao Jiang (36 papers)