Training LLMs to Better Self-Debug and Explain Code (2405.18649v1)
Abstract: In the domain of code generation, self-debugging is crucial. It allows LLMs to refine their generated code based on execution feedback. This is particularly important because generating correct solutions in one attempt proves challenging for complex tasks. Prior works on self-debugging mostly focus on prompting methods by providing LLMs with few-shot examples, which work poorly on small open-sourced LLMs. In this work, we propose a training framework that significantly improves self-debugging capability of LLMs. Intuitively, we observe that a chain of explanations on the wrong code followed by code refinement helps LLMs better analyze the wrong code and do refinement. We thus propose an automated pipeline to collect a high-quality dataset for code explanation and refinement by generating a number of explanations and refinement trajectories and filtering via execution verification. We perform supervised fine-tuning (SFT) and further reinforcement learning (RL) on both success and failure trajectories with a novel reward design considering code explanation and refinement quality. SFT improves the pass@1 by up to 15.92% and pass@10 by 9.30% over four benchmarks. RL training brings additional up to 3.54% improvement on pass@1 and 2.55% improvement on pass@10. The trained LLMs show iterative refinement ability, and can keep refining code continuously. Lastly, our human evaluation shows that the LLMs trained with our framework generate more useful code explanations and help developers better understand bugs in source code.
- Evaluating large language models trained on code, 2021.
- Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814, 2022.
- Program synthesis with large language models, 2021.
- Measuring coding challenge competence with apps. NeurIPS, 2021.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP, pages 8696–8708. Association for Computational Linguistics, 2021.
- A conversational paradigm for program synthesis. arXiv preprint, 2022.
- Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint, 2023.
- Codet5+: Open code large language models for code understanding and generation. arXiv preprint, 2023.
- Incoder: A generative model for code infilling and synthesis, 2023.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. In KDD, 2023.
- Starcoder: may the source be with you! 2023.
- Code llama: Open foundation models for code, 2023.
- Teaching large language models to self-debug, 2023.
- Demystifying gpt self-repair for code generation, 2023.
- Improving code generation by training with natural language feedback, 2023.
- Creating a coding assistant with starcoder. Hugging Face Blog, 2023. https://huggingface.co/blog/starchat.
- Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Execution-based code generation using deep reinforcement learning. arXiv preprint arXiv:2301.13816, 2023.
- Coderl: Mastering code generation through pretrained models and deep reinforcement learning. arXiv preprint, abs/2207.01780, 2022.
- Rltf: Reinforcement learning from unit test feedback, 2023.
- all-roberta-large-v1, 2024.
- Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2020.
- Proximal policy optimization algorithms, 2017.
- Decoupled weight decay regularization, 2019.
- Trl: Transformer reinforcement learning, 2020.
- GPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- DeepSeek. Deepseek coder: Let the code write itself, 2023.
- Mistral 7b, 2023.
- Mixtral of experts, 2024.
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models, 2024.
- Starcoder 2 and the stack v2: The next generation, 2024.
- Magicoder: Source code is all you need, 2023.
- Wizardcoder: Empowering code large language models with evol-instruct, 2023.
- Impact of code language models on automated program repair. In Proceedings of the 45th International Conference on Software Engineering, ICSE ’23, page 1430–1442. IEEE Press, 2023.
- Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494, 2023.
- How effective are neural networks for fixing security vulnerabilities. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, page 1282–1294, New York, NY, USA, 2023. Association for Computing Machinery.
- A deep dive into large language models for automated bug localization and repair. arXiv preprint arXiv:2404.11595, 2024.
- Large language models are few-shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2312–2323. IEEE, 2023.
- Togll: Correct and strong test oracle generation with llms. arXiv preprint arXiv:2405.03786, 2024.
- Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis, pages 423–435, 2023.
- Agentcoder: Multi-agent-based code generation with iterative testing and optimisation, 2024.
- Ldb: A large language model debugger via verifying runtime execution step-by-step, 2024.
- Intervenor: Prompting the coding ability of large language models with the interactive chain of repair, 2024.
- Leveraging print debugging to improve code generation in large language models, 2024.
- Self-collaboration code generation via chatgpt, 2024.
- Self-refine: Iterative refinement with self-feedback, 2023.
- Training language models with language feedback at scale, 2023.
- Cycle: Learning to self-refine the code generation, 2024.
- Self-edit: Fault-aware code editor for code generation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 769–787, Toronto, Canada, 2023. Association for Computational Linguistics.
- S. Kullback and R. A. Leibler. On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1):79 – 86, 1951.
- Nan Jiang (210 papers)
- Xiaopeng Li (166 papers)
- Shiqi Wang (162 papers)
- Qiang Zhou (123 papers)
- Soneya Binta Hossain (7 papers)
- Baishakhi Ray (88 papers)
- Varun Kumar (35 papers)
- Xiaofei Ma (31 papers)
- Anoop Deoras (21 papers)