LPR: Large Language Models-Aided Program Reduction (2312.13064v3)
Abstract: Program reduction is a prevalent technique to facilitate compilers' debugging by automatically minimizing bug-triggering programs. Existing program reduction techniques are either generic across languages (e.g., Perses and Vulcan) or specifically customized for one certain language by employing language-specific features, like C-Reduce. However, striking the balance between generality across multiple programming languages and specificity to individual languages in program reduction is yet to be explored. This paper proposes LPR, the first technique utilizing LLMs to perform language-specific program reduction for multiple languages. The core insight is to utilize both the language-generic syntax level program reduction (e.g., Perses) and the language-specific semantic level program transformations learned by LLMs. Alternately, language-generic program reducers efficiently reduce programs into 1-tree-minimality, which is small enough to be manageable for LLMs; LLMs effectively transform programs via the learned semantics to expose new reduction opportunities for the language-generic program reducers to further reduce the programs. Our extensive evaluation on 50 benchmarks across three languages (C, Rust, and JavaScript) has highlighted LPR's practicality and superiority over Vulcan, the state-of-the-art language-generic program reducer. For effectiveness, LPR surpasses Vulcan by producing 24.93%, 4.47%, and 11.71% smaller programs on benchmarks in C, Rust and JavaScript. Moreover, LPR and Vulcan have demonstrated their potential to complement each other. By using Vulcan on LPR's output for C programs, we achieve program sizes comparable to those reduced by C-Reduce. For efficiency, LPR takes 10.77%, 34.88%, 36.96% less time than Vulcan to finish all benchmarks in C, Rust and JavaScript, separately.
- 2023a. OpenAI API. Retrieved 2023-11-20 from https://platform.openai.com/docs/overview
- 2023b. OpenAI API: N. Retrieved 2023-11-20 from https://platform.openai.com/docs/api-reference/chat/create#chat-create-n
- 2023c. OpenAI API: Temperature. Retrieved 2023-11-20 from https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature
- Does an lstm forget more than a cnn? an empirical study of catastrophic forgetting in nlp. In Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association. 77–86.
- Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435.
- Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. arXiv preprint arXiv:2304.02014 (2023).
- Alastair Donaldson and David MacIver. 2021. Test Case Reduction: Beyond Bugs. Retrieved May 29, 2023 from https://blog.sigplan.org/2021/05/25/test-case-reduction-beyond-bugs
- Test-case reduction and deduplication almost for free with transformation-based compiler testing. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 1017–1032. https://doi.org/10.1145/3453483.3454092
- Automated Repair of Programs from Large Language Models. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 1469–1481. https://doi.org/10.1109/ICSE48619.2023.00128
- An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211 (2013).
- Qiuhan Gu. 2023. LLM-Based Code Generation Method for Golang Compiler Testing. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, Satish Chandra, Kelly Blincoe, and Paolo Tonella (Eds.). ACM, 2201–2203. https://doi.org/10.1145/3611643.3617850
- An Empirical Study on Fine-Tuning Large Language Models of Code for Automated Program Repair. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, 1162–1174.
- Jigsaw: Large Language Models meet Program Synthesis. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 1219–1231. https://doi.org/10.1145/3510003.3510203
- Christian Gram Kalhauge and Jens Palsberg. 2019. Binary reduction of dependency graphs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 556–566. https://doi.org/10.1145/3338906.3338956
- Christian Gram Kalhauge and Jens Palsberg. 2021. Logical bytecode reduction. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 1003–1016. https://doi.org/10.1145/3453483.3454091
- Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
- Compiler validation via equivalence modulo inputs. ACM Sigplan Notices 49, 6 (2014), 216–226. https://doi.org/10.1145/2594291.2594334
- Program Reconditioning: Avoiding Undefined Behaviour When Finding and Reducing Compiler Bugs. Proc. ACM Program. Lang. 7, PLDI, Article 180 (jun 2023), 25 pages. https://doi.org/10.1145/3591294
- Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting. In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11-15, 2023. IEEE, 14–26. https://doi.org/10.1109/ASE56229.2023.00089
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).
- Random testing for C and C++ compilers with YARPGen. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–25. https://doi.org/10.1145/1993498.1993532
- LLVM. 2000. LibTooling. https://clang.llvm.org/docs/LibTooling.html Accessed: 2023-04-30.
- Ghassan Misherghi and Zhendong Su. 2006. HDD: hierarchical delta debugging. In Proceedings of the 28th International Conference on Software Engineering. 142–151. https://doi.org/10.1145/1134285.1134307
- Aina Niemetz and Armin Biere. 2013. ddSMT: a delta debugger for the SMT-LIB v2 format. In Proceedings of the 11th International Workshop on Satisfiability Modulo Theories, SMT. 8–9.
- Test-case reduction for C compiler bugs. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation. 335–346. https://doi.org/10.1145/2254064.2254104
- John et al. Regehr. 2012. C-Reduce. Retrieved 2023-11-26 from https://github.com/csmith-project/creduce
- Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning. PMLR, 31210–31227.
- Finding compiler bugs via live code mutation. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. 849–863. https://doi.org/10.1145/2983990.2984038
- Toward understanding compiler bugs in GCC and LLVM. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 294–305. https://doi.org/10.1145/2931037.2931074
- Perses: Syntax-guided program reduction. In Proceedings of the 40th International Conference on Software Engineering. 361–371. https://doi.org/10.1145/3180155.3180236
- SMT Solver Validation Empowered by Large Pre-Trained Language Models. In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11-15, 2023. IEEE, 1288–1300. https://doi.org/10.1109/ASE56229.2023.00180
- Is ChatGPT the Ultimate Programming Assistant–How far is it? arXiv preprint arXiv:2304.11938 (2023).
- On the Caching Schemes to Speed Up Program Reduction. ACM Trans. Softw. Eng. Methodol. 1, 1 (January 2023), Article 1, 30 pages.
- On the Caching Schemes to Speed Up Program Reduction. ACM Trans. Softw. Eng. Methodol. 33, 1, Article 17 (nov 2023), 30 pages. https://doi.org/10.1145/3617172
- Probabilistic Delta debugging. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 881–892. https://doi.org/10.1145/3468264.3468625
- FuzzJIT: Oracle-Enhanced Fuzzing for JavaScript Engine JIT Compiler. In USENIX Security Symposium. USENIX.
- Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, Satish Chandra, Kelly Blincoe, and Paolo Tonella (Eds.). ACM, 172–184. https://doi.org/10.1145/3611643.3616271
- How Effective Are Neural Networks for Fixing Security Vulnerabilities. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (¡conf-loc¿, ¡city¿Seattle¡/city¿, ¡state¿WA¡/state¿, ¡country¿USA¡/country¿, ¡/conf-loc¿) (ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 1282–1294. https://doi.org/10.1145/3597926.3598135
- Revisiting the Plastic Surgery Hypothesis via Large Language Models. arXiv preprint arXiv:2303.10494 (2023).
- Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
- Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).
- Pushing the Limit of 1-Minimality of Language-Agnostic Program Reduction. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 636–664. https://doi.org/10.1145/3586049
- Finding and understanding bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. 283–294.
- Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering 28, 2 (2002), 183–200. https://doi.org/10.1109/32.988498
- PPR: Pairwise Program Reduction. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 338–349.
- Li Zhong and Zilong Wang. 2023. A study on robustness and reliability of large language model code generation. arXiv preprint arXiv:2308.10335 (2023).