CYCLE: Learning to Self-Refine the Code Generation (2403.18746v1)
Abstract: Pre-trained code LLMs have achieved promising performance in code generation and improved the programming efficiency of human developers. However, their self-refinement capability is typically overlooked by the existing evaluations of code LMs, which focus only on the accuracy of the one-time prediction. For the cases when code LMs fail to implement the correct program, developers actually find it hard to debug and fix the faulty prediction since it is not written by the developers themselves. Unfortunately, our study reveals that code LMs cannot efficiently self-refine their faulty generations as well. In this paper, we propose CYCLE framework, learning to self-refine the faulty generation according to the available feedback, such as the execution results reported by the test suites. We evaluate CYCLE on three popular code generation benchmarks, HumanEval, MBPP, and APPS. The results reveal that CYCLE successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs. We implement four variants of CYCLE with varied numbers of parameters across 350M, 1B, 2B, and 3B, and the experiments show that CYCLE consistently boosts the code generation performance, by up to 63.5%, across benchmarks and varied model sizes. We also notice that CYCLE outperforms code LMs that have 3$\times$ more parameters in self-refinement.
- SantaCoder: don’t reach for the stars! arXiv:2301.03988 [cs.SE]
- Amazon. 2023. Amazon CodeWhisperer: Build applications faster and more securely with your AI coding companion. https://aws.amazon.com/codewhisperer/.
- Anthropic. 2023. Introducing Claude. https://www.anthropic.com/index/introducing-claude.
- Program Synthesis with Large Language Models. CoRR abs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732
- Grounded Copilot: How Programmers Interact with Code-Generating Models. 7, OOPSLA1, Article 78 (2023), 27 pages. https://doi.org/10.1145/3586030
- Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255 [cs.CL]
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
- Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv:2107.03374 https://arxiv.org/abs/2107.03374
- Teaching Large Language Models to Self-Debug. arXiv:2304.05128 [cs.CL]
- Deep reinforcement learning from human preferences. arXiv:1706.03741 [stat.ML]
- Patching as Translation: the Data and the Metaphor. In 35th IEEE/ACM International Conference on Automated Software Engineering (Virtual Event, Australia) (ASE ’20). https://doi.org/10.1145/3324884.3416587
- PyTorch: An Imperative Style, High-Performance Deep Learning Library.
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
- InCoder: A Generative Model for Code Infilling and Synthesis. arXiv:2204.05999 [cs.SE]
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR abs/2101.00027 (2021). arXiv:2101.00027 https://arxiv.org/abs/2101.00027
- GitHub. 2021. GitHub Copilot: Your AI Pair Programmer. https://copilot.github.com/.
- Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study. arXiv:2309.08221 [cs.SE]
- Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security Hardening and Adversarial Testing. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS’23).
- Measuring Coding Challenge Competence With APPS. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung (Eds.), Vol. 1. Curran. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/c24cd76e1ce41366a4bbe8a49b02a028-Paper-round2.pdf
- The Curious Case of Neural Text Degeneration. In International Conference on Learning Representations. https://openreview.net/forum?id=rygGQyrFvH
- Large Language Models Cannot Self-Correct Reasoning Yet. arXiv:2310.01798 [cs.CL]
- HuggingFace. 2023. Hugging Face Model Hub. https://huggingface.co/models.
- Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]
- The Stack: 3 TB of permissively licensed source code. arXiv preprint arXiv:2211.15533 (2022).
- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium, 66–71. https://doi.org/10.18653/v1/D18-2012
- StarCoder: may the source be with you! arXiv:2305.06161 [cs.CL]
- Competition-level code generation with AlphaCode. Science 378, 6624 (2022), 1092–1097. https://doi.org/10.1126/science.abq1158 arXiv:https://www.science.org/doi/pdf/10.1126/science.abq1158
- Forgetful causal masking makes causal language models better zero-shot learners. https://openreview.net/forum?id=YrZEKNLWhlp
- Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651 [cs.CL]
- Mike Mirzayanov. 2020. Codeforces: Results of 2020. https://codeforces.com/blog/entry/89502.
- Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 3470–3487. https://doi.org/10.18653/v1/2022.acl-long.244
- DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents. arXiv:2303.17071 [cs.CL]
- CodeGen2: Lessons for Training LLMs on Programming and Natural Languages. arXiv:2305.02309 [cs.LG]
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=iaYcJKpY2B_
- OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt/.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]
- Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. CoRR abs/2105.12655 (2021). arXiv:2105.12655 https://arxiv.org/abs/2105.12655
- Alec Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training. https://api.semanticscholar.org/CorpusID:49313245
- Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533
- Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]
- Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv:2110.08207 [cs.LG]
- Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs.AI]
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 56 (2014), 1929–1958. http://jmlr.org/papers/v15/srivastava14a.html
- Learning to summarize from human feedback. arXiv:2009.01325 [cs.CL]
- Christoph Tillmann and Hermann Ney. 2003. Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation. Computational Linguistics 29, 1 (2003), 97–133. https://doi.org/10.1162/089120103321337458
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
- Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). 6000–6010.
- Self-Tuning for Data-Efficient Deep Learning. In International Conference on Machine Learning (ICML).
- Xuezhi Wang and Jeff Schneider. 2015. Generalization Bounds for Transfer Learning under Model Shift. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence (Amsterdam, Netherlands) (UAI’15). AUAI Press, Arlington, Virginia, USA, 922–931.
- Finetuned Language Models Are Zero-Shot Learners. arXiv:2109.01652 [cs.CL]
- Emergent Abilities of Large Language Models. arXiv:2206.07682 [cs.CL]
- Symbolic Knowledge Distillation: from General Language Models to Commonsense Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 4602–4625. https://doi.org/10.18653/v1/2022.naacl-main.341
- A Systematic Evaluation of Large Language Models of Code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (San Diego, CA, USA) (MAPS 2022). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3520312.3534862
- Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 7443–7464. https://doi.org/10.18653/v1/2023.acl-long.411
- Self-Edit: Fault-Aware Code Editor for Code Generation. arXiv:2305.04087 [cs.SE]
- Fine-Tuning Language Models from Human Preferences. arXiv:1909.08593 [cs.CL]
- Yangruibo Ding (17 papers)
- Marcus J. Min (3 papers)
- Gail Kaiser (17 papers)
- Baishakhi Ray (88 papers)