Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation (2402.13013v1)

Published 20 Feb 2024 in cs.CL

Abstract: The programming skill is one crucial ability for LLMs, necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs' performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. 2024. Anonymous submission.
  2. Unified pre-training for program understanding and generation. pages 2655–2668.
  3. Program synthesis with large language models. CoRR, abs/2108.07732.
  4. Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca.
  5. Pangu-coder: Program synthesis with function-level language modeling.
  6. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  7. Enhancing code classification by mixup-based data augmentation. CoRR, abs/2210.03003.
  8. Codebert: A pre-trained model for programming and natural languages. EMNLP 2020:1536–1547.
  9. Textbooks are all you need. CoRR, abs/2306.11644.
  10. Unixcoder: Unified cross-modal pre-training for code representation. pages 7212–7225.
  11. Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 14409–14428. Association for Computational Linguistics.
  12. Exploring representation-level augmentation for code search. pages 4924–4936.
  13. Starcoder: may the source be with you! CoRR, abs/2305.06161.
  14. Textbooks are all you need II: phi-1.5 technical report. CoRR, abs/2309.05463.
  15. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  16. Wizardcoder: Empowering code large language models with evol-instruct. CoRR, abs/2306.08568.
  17. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  18. Octopack: Instruction tuning code large language models. CoRR, abs/2308.07124.
  19. Text and code embeddings by contrastive pre-training. CoRR, abs/2201.10005.
  20. Codegen: An open large language model for code with multi-turn program synthesis.
  21. R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:13.
  22. Misleading authorship attribution of source code using adversarial learning. In 28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, August 14-16, 2019, pages 479–496. USENIX Association.
  23. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
  24. Improving neural machine translation models with monolingual data.
  25. Do not have enough data? an easy data augmentation for code summarization. In 13th IEEE International Symposium on Parallel Architectures, Algorithms and Programming, PAAP 2022, Beijing, China, November 25-27, 2022, pages 1–6. IEEE.
  26. InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
  27. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  28. Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556.
  29. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics.
  30. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. pages 8696–8708.
  31. Wizardlm: Empowering large language models to follow complex instructions. CoRR, abs/2304.12244.
  32. Genie: Achieving human parity in content-grounded datasets generation. arXiv preprint arXiv:2401.14367.
  33. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, pages 476–486. ACM.
  34. Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. CoRR, abs/2312.14187.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Demin Song (11 papers)
  2. Honglin Guo (8 papers)
  3. Yunhua Zhou (27 papers)
  4. Shuhao Xing (3 papers)
  5. Yudong Wang (28 papers)
  6. Zifan Song (5 papers)
  7. Wenwei Zhang (77 papers)
  8. Qipeng Guo (72 papers)
  9. Hang Yan (86 papers)
  10. Xipeng Qiu (257 papers)
  11. Dahua Lin (336 papers)
Citations (2)