Code Needs Comments: Enhancing Code LLMs with Comment Augmentation (2402.13013v1)
Abstract: The programming skill is one crucial ability for LLMs, necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs' performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation.
- 2024. Anonymous submission.
- Unified pre-training for program understanding and generation. pages 2655–2668.
- Program synthesis with large language models. CoRR, abs/2108.07732.
- Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca.
- Pangu-coder: Program synthesis with function-level language modeling.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- Enhancing code classification by mixup-based data augmentation. CoRR, abs/2210.03003.
- Codebert: A pre-trained model for programming and natural languages. EMNLP 2020:1536–1547.
- Textbooks are all you need. CoRR, abs/2306.11644.
- Unixcoder: Unified cross-modal pre-training for code representation. pages 7212–7225.
- Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 14409–14428. Association for Computational Linguistics.
- Exploring representation-level augmentation for code search. pages 4924–4936.
- Starcoder: may the source be with you! CoRR, abs/2305.06161.
- Textbooks are all you need II: phi-1.5 technical report. CoRR, abs/2309.05463.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Wizardcoder: Empowering code large language models with evol-instruct. CoRR, abs/2306.08568.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
- Octopack: Instruction tuning code large language models. CoRR, abs/2308.07124.
- Text and code embeddings by contrastive pre-training. CoRR, abs/2201.10005.
- Codegen: An open large language model for code with multi-turn program synthesis.
- R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:13.
- Misleading authorship attribution of source code using adversarial learning. In 28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, August 14-16, 2019, pages 479–496. USENIX Association.
- Code llama: Open foundation models for code. CoRR, abs/2308.12950.
- Improving neural machine translation models with monolingual data.
- Do not have enough data? an easy data augmentation for code summarization. In 13th IEEE International Symposium on Parallel Architectures, Algorithms and Programming, PAAP 2022, Beijing, China, November 25-27, 2022, pages 1–6. IEEE.
- InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13484–13508. Association for Computational Linguistics.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. pages 8696–8708.
- Wizardlm: Empowering large language models to follow complex instructions. CoRR, abs/2304.12244.
- Genie: Achieving human parity in content-grounded datasets generation. arXiv preprint arXiv:2401.14367.
- Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, pages 476–486. ACM.
- Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. CoRR, abs/2312.14187.
- Demin Song (11 papers)
- Honglin Guo (8 papers)
- Yunhua Zhou (27 papers)
- Shuhao Xing (3 papers)
- Yudong Wang (28 papers)
- Zifan Song (5 papers)
- Wenwei Zhang (77 papers)
- Qipeng Guo (72 papers)
- Hang Yan (86 papers)
- Xipeng Qiu (257 papers)
- Dahua Lin (336 papers)