Syntactic Robustness for LLM-based Code Generation (2404.01535v1)
Abstract: Rapid advances in the field of LLMs have made LLM-based code generation an important area for investigation. An LLM-based code generator takes a prompt as input and produces code that implements the requirements specified in the prompt. Many software requirements include mathematical formulas that specify the expected behavior of the code to be generated. Given a code generation prompt that includes a mathematical formula, a reasonable expectation is that, if the formula is syntactically modified without changing its semantics, the generated code for the modified prompt should be semantically equivalent. We formalize this concept as syntactic robustness and investigate the syntactic robustness of GPT-3.5-Turbo and GPT-4 as code generators. To test syntactic robustness, we generate syntactically different but semantically equivalent versions of prompts using a set of mutators that only modify mathematical formulas in prompts. In this paper, we focus on prompts that ask for code that generates solutions to variables in an equation, when given coefficients of the equation as input. Our experimental evaluation demonstrates that GPT-3.5-Turbo and GPT-4 are not syntactically robust for this type of prompts. To improve syntactic robustness, we define a set of reductions that transform the formulas to a simplified form and use these reductions as a pre-processing step. Our experimental results indicate that the syntactic robustness of LLM-based code generation can be improved using our approach.
- D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: Multi-agent-based code generation with iterative testing and optimisation,” arXiv preprint arXiv:2312.13010, 2023.
- M. Liu, J. Wang, T. Lin, Q. Ma, Z. Fang, and Y. Wu, “An empirical study of the code generation of safety-critical software using llms,” Applied Sciences, vol. 14, no. 3, p. 1046, 2024.
- G. L. Scoccia, “Exploring early adopters’ perceptions of chatgpt as a code generation tool,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW). IEEE, 2023, pp. 88–93.
- T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev, “Ai2: Safety and robustness certification of neural networks with abstract interpretation,” in 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 2018, pp. 3–18.
- A. Kashefi and T. Mukerji, “Chatgpt for programming numerical methods,” Journal of Machine Learning for Modeling and Computing, vol. 4, no. 2, 2023.
- A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” arXiv preprint arXiv:2310.03533, 2023.
- “gcc,” https://gcc.gnu.org/, accessed: 2024-03-22.
- “sympy,” https://www.sympy.org/en/index.html, accessed: 2024-03-22.
- OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, and et al., “Gpt-4 technical report,” 2024.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
- Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022.
- E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
- C. Liu, B. Xuanlin, H. Zhang, N. Zhang, H. Hu, X. Zhang, and M. Yan, “Improving chatgpt prompt for code generation,” 05 2023.
- S. Ouyang, J. Zhang, M. Harman, and M. Wang, “Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,” 08 2023.
- A. Buscemi, “A comparative study of code generation using chatgpt 3.5 across 10 programming languages,” 08 2023.
- F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022, pp. 1–10.
- E. Jiang, E. Toh, A. Molina, K. Olson, C. Kayacik, A. Donsbach, C. J. Cai, and M. Terry, “Discovering the syntax and strategies of natural language programming with generative language models,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022, pp. 1–19.
- T. Ahmed, K. S. Pai, P. Devanbu, and E. T. Barr, “Improving few-shot prompts with relevant static analysis products,” arXiv preprint arXiv:2304.06815, 2023.
- B. Yetiştiren, I. Özsoy, M. Ayerdem, and E. Tüzün, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” 04 2023.
- J. Liu, C. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” 05 2023.
- J. Li, Y. Zhao, Y. Li, G. Li, and Z. Jin, “Towards enhancing in-context learning for code generation,” arXiv preprint arXiv:2303.17780, 2023.
- J.-B. Döderlein, M. Acher, D. E. Khelladi, and B. Combemale, “Piloting copilot and codex: Hot temperature, cold prompts, or black magic?” arXiv preprint arXiv:2210.14699, 2022.
- J. He and M. Vechev, “Controlling large language models to generate secure and vulnerable code,” arXiv e-prints, pp. arXiv–2302, 2023.
- J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, “Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,” arXiv preprint arXiv:2303.07839, 2023.
- J. Li, Y. Li, G. Li, Z. Jin, Y. Hao, and X. Hu, “Skcoder: A sketch-based approach for automatic code generation,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2124–2135.
- J. Li, G. Li, Y. Li, and Z. Jin, “Enabling programming thinking in large language models toward code generation,” arXiv preprint arXiv:2305.06599, 2023.
- S. Jiang, Y. Wang, and Y. Wang, “Selfevolve: A code evolution framework via large language models,” arXiv preprint arXiv:2306.02907, 2023.
- J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- B. Yetiştiren, I. Özsoy, M. Ayerdem, and E. Tüzün, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” arXiv preprint arXiv:2304.10778, 2023.
- A. Borji, “A categorical archive of chatgpt failures,” arXiv preprint arXiv:2302.03494, 2023.
- T. Dinh, J. Zhao, S. Tan, R. Negrinho, L. Lausen, S. Zha, and G. Karypis, “Large language models of code fail at completing code with potential bugs,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- M. Yan, J. Chen, J. M. Zhang, X. Cao, C. Yang, and M. Harman, “Coco: Testing code generation systems via concretized instructions,” arXiv preprint arXiv:2308.13319, 2023.
- A. Mastropaolo, L. Pascarella, E. Guglielmi, M. Ciniselli, S. Scalabrino, R. Oliveto, and G. Bavota, “On the robustness of code generation techniques: An empirical study on github copilot,” 02 2023.
- G. Katz, D. A. Huang, D. Ibeling, K. Julian, C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, A. Zeljić et al., “The marabou framework for verification and analysis of deep neural networks,” in International Conference on Computer Aided Verification. Springer, 2019, pp. 443–452.
- G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient smt solver for verifying deep neural networks,” in International Conference on Computer Aided Verification. Springer, 2017, pp. 97–117.
- R. Bunel, I. Turkaslan, P. H. Torr, P. Kohli, and M. P. Kumar, “Piecewise linear neural networks verification: A comparative study,” 2018.
- R. Bunel, P. Mudigonda, I. Turkaslan, P. Torr, J. Lu, and P. Kohli, “Branch and bound for piecewise linear neural network verification,” Journal of Machine Learning Research, vol. 21, no. 2020, 2020.
- Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening, “Concolic testing for deep neural networks,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018, pp. 109–119.
- W. Lin, Z. Yang, X. Chen, Q. Zhao, X. Li, Z. Liu, and J. He, “Robustness verification of classification deep neural networks via linear programming,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 418–11 427.
- G. Singh, T. Gehr, M. Mirman, M. Püschel, and M. T. Vechev, “Fast and effective robustness certification,” NeurIPS, vol. 1, no. 4, p. 6, 2018.
- G. Singh, T. Gehr, M. Püschel, and M. Vechev, “An abstract domain for certifying neural networks,” Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 1–30, 2019.
- S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, “Formal security analysis of neural networks using symbolic intervals,” in 27th {normal-{\{{USENIX}normal-}\}} Security Symposium ({normal-{\{{USENIX}normal-}\}} Security 18), 2018, pp. 1599–1614.
- T. Baluta, Z. L. Chua, K. S. Meel, and P. Saxena, “Scalable quantitative verification for deep neural networks,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 312–323.
- X. Xie, L. Ma, F. Juefei-Xu, M. Xue, H. Chen, Y. Liu, J. Zhao, B. Li, J. Yin, and S. See, “Deephunter: a coverage-guided fuzz testing framework for deep neural networks,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2019, pp. 146–157.
- T. Y. Chen, S. C. Cheung, and S. M. Yiu, “Metamorphic testing: a new approach for generating next test cases,” arXiv preprint arXiv:2002.12543, 2020.
- C. Tsigkanos, P. Rani, S. Müller, and T. Kehrer, “Large language models: The next frontier for variable discovery within metamorphic testing?” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2023, pp. 678–682.
- L. Applis, A. Panichella, and A. van Deursen, “Assessing robustness of ml-based program analysis tools using metamorphic program transformations,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2021, pp. 1377–1381.
- Laboni Sarker (3 papers)
- Mara Downing (2 papers)
- Achintya Desai (3 papers)
- Tevfik Bultan (8 papers)