Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Robustness of Large Language Models for Solving Programming Problems (2306.14583v1)

Published 26 Jun 2023 in cs.CL, cs.AI, and cs.SE

Abstract: Using LLMs for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Prompt sensitivity of language model for solving programming problems. In Proceedings of the 21st International Conference on Intelligent Software Methodologies, Tools and Techniques (SOMET), pages 346–359, 2022.
  2. Evaluating large language models trained on code. arXiv preprint, arXiv:2107.03374, 2021.
  3. Retrieval augmented code generation and summarization. In Findings of the Association for Computational Linguistics: EMNLP, pages 2719–2734, 2021.
  4. CodeGen: An open large language model for code with multi-turn program synthesis. arXiv preprint, arXiv:2203.13474, 2022.
  5. InCoder: A generative model for code infilling and synthesis. In International Conference on Learning Representations (ICLR), 2023.
  6. Competition-level code generation with AlphaCode. Science, 378(6624):1092–1097, 2022.
  7. CodeT: Code generation with generated tests. In International Conference on Learning Representations (ICLR), 2023.
  8. PaLM: Scaling language modeling with pathways. arXiv preprint, arXiv:2204.02311, 2022.
  9. PanGu-Coder: Program synthesis with function-level language modeling. arXiv preprint, arXiv:2207.11280, 2022.
  10. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 21314–21328, 2022.
  11. A systematic evaluation of large language models of code. In Deep Learning for Code Workshop, 2022.
  12. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8696–8708, 2021a.
  13. Code generation as a dual task of code summarization. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019.
  14. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
  15. Unified pre-training for program understanding and generation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2655–2668, 2021.
  16. Unsupervised translation of programming languages. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 20601–20611, 2020.
  17. Leveraging automated unit tests for unsupervised code translation. In International Conference on Learning Representations (ICLR), 2022.
  18. IntelliCode Compose: Code generation using transformer. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 1433––1443, 2020.
  19. Code completion for programming education based on deep learning. International Journal of Computational Intelligence Studies, 10(2-3):78–98, 2021.
  20. Tfix: Learning to fix coding errors with a text-to-text transformer. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139, pages 780–791, 2021.
  21. A bidirectional lstm language model for code evaluation and repair. Symmetry, 13(2), 2021.
  22. A model with iterative trials for correcting logic errors in source code. Applied Sciences, 11(11), 2021.
  23. Can openai’s codex fix bugs? an evaluation on quixbugs. In Proceedings of the Third International Workshop on Automated Program Repair (APR), pages 69––75, 2022.
  24. Examining zero-shot vulnerability repair with large language models. In Proceedings of the 2023 IEEE Symposium on Security and Privacy (SP), pages 2339–2356, 2023.
  25. Repair is nearly generation: Multilingual program repair with llms. arXiv preprint, arXiv:2208.11640, 2022.
  26. Automatic algorithm recognition of source-code using machine learning. In Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 170–177, 2017.
  27. Algorithm identification in programming assignments. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension (ICPC), page 471–481, 2022.
  28. Identifying algorithm in program code based on structural features using cnn classification model. Applied Intelligence, 53(10):12210–12236, 2023.
  29. Are large pre-trained language models leaking your personal information? In Findings of the Association for Computational Linguistics: EMNLP, pages 2038–2047, 2022.
  30. To what extent do deep learning-based code recommenders generate predictions by cloning code from the training set? In Proceedings of the 19th International Conference on Mining Software Repositories (MSR), pages 167––178, 2022.
  31. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security), pages 2633–2650, 2021.
  32. Memorization vs. generalization : Quantifying data leakage in NLP performance evaluation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1325–1335, 2021.
  33. Codex hacks hackerrank: Memorization issues and a framework for code synthesis evaluation. arXiv preprint, arXiv:2212.02684, 2022.
  34. On the robustness of code generation techniques: An empirical study on github copilot. arXiv preprint, arXiv:2302.00438, 2023.
  35. ReCode: Robustness evaluation of code generation models. arXiv preprint, arXiv:2212.10264, 2022a.
  36. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 27730–27744, 2022.
  37. Yutaka Watanobe. Aizu Online Judge, 2004. URL https://onlinejudge.u-aizu.ac.jp/.
  38. Online judge system: Requirements, architecture, and experiences. International Journal of Software Engineering and Knowledge Engineering, 32(4):1–30, 2022.
  39. A survey on online judge systems and their applications. ACM Comput. Surv., 51(1), 2018.
  40. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), volume 30, 2017.
  41. OpenAI. GPT-4 technical report. arXiv preprint, arXiv:2303.08774, 2023.
  42. Language models are few-shot learners. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 1877–1901, 2020.
  43. Measuring coding challenge competence with APPS. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
  44. Solving linear algebra by program synthesis. arXiv preprint, arXiv:2111.08171, 2021.
  45. Solving probability and statistics problems by probabilistic program synthesis at human level and predicting solvability. In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners’ and Doctoral Consortium, pages 612––615, 2022.
  46. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences of the United States of America, 119(32):e2123433119, 2022.
  47. Automatic generation of programming exercises and code explanations using large language models. In Proceedings of the 2022 ACM Conference on International Computing Education Research (ICER) - Volume 1, pages 27––43, 2022.
  48. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint, arXiv:2101.00027, 2020.
  49. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL), pages 311–318, 2002.
  50. CodeBLEU: A method for automatic evaluation of code synthesis. arXiv preprint, arXiv:2009.10297, 2020.
  51. Spoc: Search-based pseudocode to code. In Advances in Neural Information Processing Systems, volume 32, 2019.
  52. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In Proceedings of the IEEE Symposium on Security and Privacy (SP), pages 980–994, 2022.
  53. A Universally Unique IDentifier (UUID) URN Namespace. RFC 4122, 2005.
  54. Probing pretrained models of source codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 371–383, 2022.
  55. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  56. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint, arXiv:1907.11692, 2019.
  57. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP, pages 1536–1547, 2020.
  58. Learning and evaluating contextual embedding of source code. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119, pages 5110–5121, 2020.
  59. SynCoBERT: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint, arXiv:2108.04556, 2021b.
  60. TreeBERT: A tree-based pre-trained model for programming language. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161, pages 54–63, 2021.
  61. GraphCodeBERT: Pre-training code representations with data flow. In International Conference on Learning Representations (ICLR), 2021.
  62. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020.
  63. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  64. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 336–347, 2021.
  65. CoTexT: Multi-task learning with code-text transformer. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog), pages 40–47, 2021.
  66. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, 2016.
  67. Code generation for unknown libraries via reading api documentations. arXiv preprint, arXiv:2202.07806, 2022.
  68. DocPrompting: Generating code by retrieving the docs. In International Conference on Learning Representations (ICLR), 2023.
  69. Compilable neural code generation with compiler feedback. In Findings of the Association for Computational Linguistics: ACL, pages 9–19, 2022b.
  70. Natural language to code translation with execution. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3533–3546, 2022.
  71. UniXcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7212–7225, 2022.
  72. DeepCoder: Learning to write programs. In International Conference on Learning Representations (ICLR), 2017.
  73. CodeNet: A large-scale ai for code dataset for learning a diversity of coding tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
  74. The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the Australasian Computing Education Conference (ACE), pages 10––19, 2022.
  75. What do pre-trained code models know about code? In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1332–1336, 2021.
  76. Less is more: Summary of long instructions is better for program synthesis. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4532–4552, 2022.
  77. How much do language models copy from their training data? evaluating linguistic novelty in text generation using RAVEN. arXiv preprint, arXiv:2111.09509, 2021.
  78. Code summarization: Do transformers really understand code? In Deep Learning for Code Workshop, 2022.
  79. DS-1000: A natural and reliable benchmark for data science code generation. arXiv preprint, arXiv:2211.11501, 2022.
  80. A simple, yet effective approach to finding biases in code generation. arXiv preprint, arXiv:2211.00609, 2022.
  81. Large language models can be easily distracted by irrelevant context. arXiv preprint, arXiv:2302.00093, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Atsushi Shirafuji (6 papers)
  2. Yutaka Watanobe (10 papers)
  3. Takumi Ito (25 papers)
  4. Makoto Morishita (20 papers)
  5. Yuki Nakamura (16 papers)
  6. Yusuke Oda (15 papers)
  7. Jun Suzuki (86 papers)
Citations (12)