Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Capabilities of LLMs for Code Change Related Tasks (2407.02824v1)

Published 3 Jul 2024 in cs.SE

Abstract: Developers deal with code-change-related tasks daily, e.g., reviewing code. Pre-trained code and code-change-oriented models have been adapted to help developers with such tasks. Recently, LLMs have shown their effectiveness in code-related tasks. However, existing LLMs for code focus on general code syntax and semantics rather than the differences between two code versions. Thus, it is an open question how LLMs perform on code-change-related tasks. To answer this question, we conduct an empirical study using \textgreater 1B parameters LLMs on three code-change-related tasks, i.e., code review generation, commit message generation, and just-in-time comment update, with in-context learning (ICL) and parameter-efficient fine-tuning (PEFT, including LoRA and prefix-tuning). We observe that the performance of LLMs is poor without examples and generally improves with examples, but more examples do not always lead to better performance. LLMs tuned with LoRA have comparable performance to the state-of-the-art small pre-trained models. Larger models are not always better, but \textsc{Llama~2} and \textsc{Code~Llama} families are always the best. The best LLMs outperform small pre-trained models on the code changes that only modify comments and perform comparably on other code changes. We suggest future work should focus more on guiding LLMs to learn the knowledge specific to the changes related to code rather than comments for code-change-related tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Santacoder: don’t reach for the stars! CoRR, abs/2301.03988, 2023.
  2. Parameter-efficient finetuning of transformers for source code. CoRR, abs/2212.05901, 2022.
  3. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. If you use this software, please cite it using these metadata.
  4. Impact of peer code review on peer impression formation: A survey. In 2013 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 133–142. IEEE, 2013.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Automatically documenting program changes. In Proceedings of the 25th IEEE/ACM international conference on automated software engineering, pages 33–42, 2010.
  7. Ernie-code: Beyond english-centric cross-lingual pretraining for programming languages. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 10628–10650. Association for Computational Linguistics, 2023.
  8. Revisiting parameter-efficient tuning: Are we really there yet? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2612–2626. Association for Computational Linguistics, 2022.
  9. One model for all domains: Collaborative domain-prefix tuning for cross-domain NER. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 5030–5038. ijcai.org, 2023.
  10. Inducer-tuning: Connecting prefix-tuning and adapter-tuning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 793–808. Association for Computational Linguistics, 2022.
  11. PYEVOLVE: automating frequent code changes in python ML systems. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 995–1007. IEEE, 2023.
  12. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
  13. Fira: fine-grained graph-based code change representation for automated commit message generation. In Proceedings of the 44th International Conference on Software Engineering, pages 970–981, 2022.
  14. Anna Maria Eilertsen. Refactoring operations grounded in manual code changes. In Gregg Rothermel and Doo-Hwan Bae, editors, ICSE ’20: 42nd International Conference on Software Engineering, Companion Volume, Seoul, South Korea, 27 June - 19 July, 2020, pages 182–185. ACM, 2020.
  15. Michael Fagan. Design and code inspections to reduce errors in program development. In Software pioneers: contributions to software engineering, pages 575–607. Springer, 2011.
  16. Change distilling: Tree differencing for fine-grained source code change extraction. IEEE Trans. Software Eng., 33(11):725–743, 2007.
  17. Incoder: A generative model for code infilling and synthesis. 2023.
  18. Using information fragments to answer the questions developers ask. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, pages 175–184, 2010.
  19. Constructing effective in-context demonstration for code intelligence tasks: An empirical study. arXiv preprint arXiv:2304.07575, 2023.
  20. Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. 2024.
  21. Exploring the potential of chatgpt in automated code refinement: An empirical study. arXiv preprint arXiv:2309.08221, 2023.
  22. Deep just-in-time consistent comment update via source code changes. In 13th IEEE International Symposium on Parallel Architectures, Algorithms and Programming, PAAP 2022, Beijing, China, November 25-27, 2022, pages 1–6. IEEE, 2022.
  23. Cc2vec: Distributed representations of code changes. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 518–529, 2020.
  24. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  25. Correlating automated and human evaluation of code documentation generation quality. ACM Transactions on Software Engineering and Methodology (TOSEM), 31(4):1–28, 2022.
  26. Codesearchnet challenge: Evaluating the state of semantic code search. CoRR, abs/1909.09436, 2019.
  27. Uncovering the causes of emotions in software developer communication using zero-shot llms. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024.
  28. Automatically generating commit messages from diffs using neural machine translation. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 135–146. IEEE, 2017.
  29. Studying just-in-time defect prediction using cross-project models. Empirical Software Engineering, 21:2072–2106, 2016.
  30. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  31. Hey! are you committing tangled changes? In Proceedings of the 22nd International Conference on Program Comprehension, pages 262–265, 2014.
  32. Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 388–395, 2004.
  33. Code review for newcomers: is it different? In Helen Sharp, Cleidson R. B. de Souza, Daniel Graziotin, Meira Levy, and David Socha, editors, Proceedings of the 11th International Workshop on Cooperative and Human Aspects of Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, pages 29–32. ACM, 2018.
  34. Towards enhancing in-context learning for code generation. arXiv preprint arXiv:2303.17780, 2023.
  35. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  36. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 4582–4597. Association for Computational Linguistics, 2021.
  37. Automating code review activities by large-scale pre-training. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1035–1047, 2022.
  38. Automated comment update: How far are we? In 29th IEEE/ACM International Conference on Program Comprehension, ICPC 2021, Madrid, Spain, May 20-21, 2021, pages 36–46. IEEE, 2021.
  39. CCT5: A code-change-oriented pre-trained model. In Satish Chandra, Kelly Blincoe, and Paolo Tonella, editors, Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, pages 1509–1521. ACM, 2023.
  40. Predictive comment updating with heuristics and ast-path-based neural learning: A two-phase approach. IEEE Transactions on Software Engineering, 49(4):1640–1660, 2022.
  41. Classifying software maintenance. In 1988 Conference on Software Maintenance, pages 241–247. IEEE Computer Society, 1988.
  42. Few-shot learning with multilingual generative language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 9019–9052. Association for Computational Linguistics, 2022.
  43. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023, 2016.
  44. Non-autoregressive line-level code completion. ACM Transactions on Software Engineering and Methodology, 2024.
  45. Atom: Commit message generation based on abstract syntax tree and hybrid ranking. IEEE Transactions on Software Engineering, 48(5):1800–1817, 2020.
  46. Commitbart: A large pre-trained model for github commits. CoRR, abs/2208.08100, 2022.
  47. Delving into parameter-efficient fine-tuning in code change learning: An empirical study. arXiv preprint arXiv:2402.06247, 2024.
  48. Radiology-gpt: A large language model for radiology. CoRR, abs/2306.08666, 2023.
  49. Neural-machine-translation-based commit message generation: how far are we? In Marianne Huchard, Christian Kästner, and Gordon Fraser, editors, Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3-7, 2018, pages 373–384. ACM, 2018.
  50. Just-in-time obsolete comment detection and update. IEEE Trans. Software Eng., 49(1):1–23, 2023.
  51. Automating just-in-time comment updating. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pages 585–597, 2020.
  52. Automating just-in-time comment updating. In 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, September 21-25, 2020, pages 585–597. IEEE, 2020.
  53. A neural architecture for generating natural language descriptions from source code changes. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short Papers, pages 287–292. Association for Computational Linguistics, 2017.
  54. Llama-reviewer: Advancing code review automation with large language models through parameter-efficient fine-tuning. In 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), pages 647–658. IEEE, 2023.
  55. Wizardcoder: Empowering code large language models with evol-instruct. CoRR, abs/2306.08568, 2023.
  56. Using in-context learning to improve dialogue safety. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 11882–11910. Association for Computational Linguistics, 2023.
  57. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
  58. Crosslingual generalization through multitask finetuning. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 15991–16111. Association for Computational Linguistics, 2023.
  59. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  60. Deep just-in-time inconsistency detection between comments and source code. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 427–435, 2021.
  61. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  62. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  63. Automatic prompt optimization with ”gradient descent” and beam search. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7957–7968. Association for Computational Linguistics, 2023.
  64. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  65. Correct: code reviewer recommendation in github based on cross-project and technology experience. In Proceedings of the 38th international conference on software engineering companion, pages 222–231, 2016.
  66. Sebastian Raschka. Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint arXiv:1811.12808, 2018.
  67. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  68. RACE: retrieval-augmented commit message generation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5520–5530. Association for Computational Linguistics, 2022.
  69. Core: Automating review recommendation for code changes. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 284–295. IEEE, 2020.
  70. A large-scale empirical study of commit message generation: models, datasets and evaluation. Empirical Software Engineering, 27(7):198, 2022.
  71. Scaling laws vs model architectures: How does inductive bias influence scaling? In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 12342–12364. Association for Computational Linguistics, 2023.
  72. Scale efficiently: Insights from pre-training and fine-tuning transformers. arXiv preprint arXiv:2109.10686, 2021.
  73. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  74. Rosalia Tufano. Automating code review. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pages 192–196. IEEE, 2023.
  75. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  76. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 8696–8708. Association for Computational Linguistics, 2021.
  77. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022.
  78. Exploring parameter-efficient fine-tuning techniques for code generation with large language models. arXiv preprint arXiv:2308.10462, 2023.
  79. Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification. In 2016 IEEE 6th International conference on advanced computing (IACC), pages 78–83. IEEE, 2016.
  80. Qilin-med: Multi-stage knowledge injection advanced medical large language model. arXiv preprint arXiv:2310.09089, 2023.
  81. Evaluating instruction-tuned large language models on code comprehension and generation. CoRR, abs/2308.01240, 2023.
  82. Large language models meet NL2Code: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, Toronto, Canada, July 2023. Association for Computational Linguistics.
  83. An efficient optimized framework for analyzing the performance of breast cancer using machine learning algorithms. Journal of Theoretical and Applied Information Technology, 100(14):5165–78, 2022.
  84. Coditt5: Pretraining for source code and natural language editing. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–12, 2022.
  85. Domain-oriented prefix-tuning: Towards efficient and generalizable fine-tuning for zero-shot dialogue summarization. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 4848–4862. Association for Computational Linguistics, 2022.
  86. Ccbert: Self-supervised code change representation learning. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 182–193, Los Alamitos, CA, USA, oct 2023. IEEE Computer Society.
  87. Hatcup: hybrid analysis and attention based just-in-time comment updating. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pages 619–630, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Lishui Fan (3 papers)
  2. Jiakun Liu (43 papers)
  3. Zhongxin Liu (23 papers)
  4. David Lo (229 papers)
  5. Xin Xia (171 papers)
  6. Shanping Li (17 papers)
Citations (4)