Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions (2312.12450v6)

Published 11 Dec 2023 in cs.SE, cs.AI, cs.LG, and cs.PL
Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

Abstract: A significant amount of research is focused on developing and evaluating LLMs for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.

Evaluating the Code Editing Capabilities of LLMs

The paper entitled "Can It Edit? Evaluating the Ability of LLMs to Follow Code Editing Instructions" addresses a niche yet increasingly important field of paper within AI-driven software engineering: code editing tasks performed by LLMs. Until this paper, the primary focus has been on code synthesis tasks, which range from generating code based on natural language to developing tests and code explanations. Recognizing a gap in the literature, this research pivots to investigate how well LLMs handle the specific directives involved in modifying pre-existing code.

This research introduces a custom benchmark, C, designed for instructional code editing, to rigorously evaluate various LLMs' performance. Key tasks analyzed involve models receiving specific code segments paired with instructions to add features, identify and resolve bugs, or apply alternative solutions. Through this lens, the authors expose disparities between major LLMs, notably indicating that the proprietary GPT-3.5-Turbo surpasses leading publicly accessible models in code editing performance.

The paper makes significant contributions through its establishment of the C benchmark, consisting of 105 curated code editing problems spanning various computer science sectors, and a unique metric titled ExcessCode, which quantifies code surplus generated by LLMs during correct code alterations. By also curating a permissively licensed dataset for training LLMs on these tasks, the authors manage to close the performance gap between open and closed-source models. The adjusted models, referred to as E, demonstrate markedly enhanced capabilities in code editing after targeted fine-tuning on this dataset.

Their empirical evaluation finds that closed-source models like GPT-4 still lead in performance, surpassing newly fine-tuned open models in task accuracy for complex code edits as evident on the C benchmark. However, models fine-tuned with task-specific data, like E, are revealed to be strong contenders, especially for structural code tasks. In particular, these results suggest that targeted data and training methodologies are pivotal in bolstering open code LLMs' abilities.

The implications of this paper are multifaceted:

  1. Benchmarking: It provides a replicable standard for future assessments of LLMs in code editing, allowing for more direct comparisons of model capabilities.
  2. Training Paradigms: Insights from the dataset and fine-tuning strategies suggest that models historically categorized for code synthesis can attain substantial improvements in code editing through dedicated datasets and meticulous instruction-tuning.
  3. Open vs. Closed Models: The findings indicate that improvements in open-source models could eventually reduce reliance on proprietary systems, with broad implications on accessibility and further innovation in AI-driven code platforms.

As LLMs evolve, these findings motivate ongoing exploration of code editing with nuanced data-honing techniques, providing a springboard for more accessible, AI-facilitated software development. This work highlights the potential of LLMs not only in creating new code but also in enhancing and refactoring existing code bases, thereby contributing significantly to software development cycles across various domains. Future advancements could see the integration of such models into IDEs, further streamlining the coding process and enhancing developer productivity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. DeepSeek AI. 2023. DeepSeek Coder: Let the Code Write Itself. https://github.com/deepseek-ai/DeepSeek-Coder.
  2. Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732 (2021).
  3. StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code. arXiv:2306.04556 [cs.LG]
  4. Ned Batchelder and Contributors to Coverage.py. [n. d.]. Coverage.py: The code coverage tool for Python. https://github.com/nedbat/coveragepy
  5. Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255 [cs.CL]
  6. Andrei Z. Broder. 2000. Identifying and Filtering Near-Duplicate Documents. In Combinatorial Pattern Matching, Raffaele Giancarlo and David Sankoff (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–10.
  7. Federico Cassano. 2023. A Pipeline for Fine-Tuning HuggingFace Models.
  8. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering (TSE) 49, 7 (2023), 3675–3691.
  9. Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github.com/sahil280114/codealpaca.
  10. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  11. Teaching Large Language Models to Self-Debug. arXiv:2304.05128 [cs.CL]
  12. GitHub Copilot. 2023. Github Copilot Your AI pair programmer. https://github.com/features/copilot
  13. Cursor. 2023. Cursor: The AI-first Code Editor. https://cursor.sh/features Accessed: 2023-12-03.
  14. Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691 [cs.LG]
  15. Automated Repair of Programs from Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 1469–1481. https://doi.org/10.1109/ICSE48619.2023.00128
  16. InCoder: A Generative Model for Code Infilling and Synthesis. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=hQwb-lbM6EL
  17. Grace: Language Models Meet Code Edits. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (, San Francisco, CA, USA,) (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 1483–1495. https://doi.org/10.1145/3611643.3616253
  18. InstructCoder: Empowering Language Models for Code Editing. arXiv:2310.20329 [cs.CL]
  19. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL]
  20. InferFix: End-to-End Program Repair with LLMs. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (, San Francisco, CA, USA,) (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 1646–1656. https://doi.org/10.1145/3611643.3613892
  21. Repair is Nearly Generation: Multilingual Program Repair with LLMs. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence (AAAI’23/IAAI’23/EAAI’23). AAAI Press, Article 573, 10 pages. https://doi.org/10.1609/aaai.v37i4.25642
  22. Efficient Memory Management for Large Language Model Serving with PagedAttention. In ACM SIGOPS Symposium on Operating Systems Principles (SOSP).
  23. OpenAssistant Conversations – Democratizing Large Language Model Alignment. arXiv:2304.07327 [cs.CL]
  24. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 8424–8445. https://doi.org/10.18653/v1/2022.acl-long.577
  25. Finding Similar Items (2 ed.). Cambridge University Press, 68–122. https://doi.org/10.1017/CBO9781139924801.004
  26. CodeEditor: Learning to Edit Source Code with Pre-Trained Models. ACM Trans. Softw. Eng. Methodol. 32, 6, Article 143 (sep 2023), 22 pages. https://doi.org/10.1145/3597207
  27. StarCoder: May the Source Be with You! https://doi.org/10.48550/arXiv.2305.06161 arXiv:2305.06161 [cs]
  28. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv preprint arXiv:2305.01210 (2023). https://doi.org/10.48550/arXiv.2305.01210
  29. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. https://doi.org/10.48550/arXiv.2306.08568 arXiv:2306.08568 [cs]
  30. Coffee: Boost Your Code LLMs by Fixing Bugs with Feedback. arXiv:2311.07215 [cs.CL]
  31. OctoPack: Instruction Tuning Code Large Language Models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following. https://openreview.net/forum?id=CjrPqvvUXL
  32. Is Self-Repair a Silver Bullet for Code Generation? arXiv:2306.09896 [cs.CL]
  33. OpenAI. 2023a. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  34. OpenAI. 2023b. Introducing ChatGPT Enterprise. https://openai.com/blog/introducing-chatgpt-enterprise Accessed: 2023-12-03.
  35. OpenAI. 2023c. Terms of Service. https://openai.com/policies/terms-of-use Accessed: August 17, 2023.
  36. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27730–27744. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
  37. Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models. arXiv:2302.04662 [cs.PL]
  38. ZeRO: Memory Optimizations toward Training Trillion Parameter Models. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
  39. Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]
  40. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=vAElhFcKW6
  41. Self-Instruct: Aligning Language Model with Self Generated Instructions. In Annual Meeting of the Association of Computation Linguistics (ACL).
  42. Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (, San Francisco, CA, USA,) (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 172–184. https://doi.org/10.1145/3611643.3616271
  43. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Qun Liu and David Schlangen (Eds.). Association for Computational Linguistics, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
  44. Ming-Ho Yee and Arjun Guha. 2023. Do Machine Learning Models Produce TypeScript Types that Type Check? . https://doi.org/10.48550/arXiv.2302.12163
  45. Self-Edit: Fault-Aware Code Editor for Code Generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 769–787. https://doi.org/10.18653/v1/2023.acl-long.45
  46. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. arXiv:2309.11998 [cs.CL]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Federico Cassano (16 papers)
  2. Luisa Li (2 papers)
  3. Akul Sethi (1 paper)
  4. Noah Shinn (4 papers)
  5. Abby Brennan-Jones (1 paper)
  6. Anton Lozhkov (7 papers)
  7. Carolyn Jane Anderson (15 papers)
  8. Arjun Guha (44 papers)
  9. Jacob Ginesin (6 papers)
  10. Edward Berman (10 papers)
  11. George Chakhnashvili (1 paper)
Citations (11)