Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning (2312.14187v5)

Published 20 Dec 2023 in cs.CL, cs.AI, and cs.SE
WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning

Abstract: Recent work demonstrates that, after instruction tuning, Code LLMs (Code LLMs) can obtain impressive capabilities to address a wide range of code-related tasks. However, current instruction tuning methods for Code LLMs mainly focus on the traditional code generation task, resulting in poor performance in complex multi-task scenarios. In this paper, we concentrate on multiple code-related tasks and present WaveCoder, a series of Code LLMs trained with Widespread And Versatile Enhanced instruction data. To enable the models to tackle complex code-related tasks, we propose a method to stably generate diverse, high-quality instruction data from open source code dataset in multi-task scenarios and obtain CodeSeaXDataset, a dataset comprising 19,915 instruction instances across 4 code-related tasks, which is aimed at improving the generalization ability of Code LLM. Our experiments demonstrate that WaveCoder models significantly outperform other open-source models in terms of the generalization ability across different code-related tasks. Moreover, WaveCoder-Ultra-6.7B presents the state-of-the-art generalization abilities on a wide range of code-related tasks.

An Insightful Overview of WaveCoder: Enhanced Instruction Tuning with Refined Data Generation

The recent publication titled "WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation" presents a significant contribution to the domain of Code LLMs. Authored by experts from Microsoft, the paper aims to address two perennial issues in instruction tuning: data duplication and insufficient data quality control. By introducing a novel LLM-based Generator-Discriminator framework, the authors propose a sophisticated methodology to generate high-quality, diverse instruction data specifically designed for code-related tasks. This work is crystallized in the development of the WaveCoder model and the CodeOcean dataset.

Methodological Innovations

The authors delineate their approach as an extension to existing methodologies by implementing a multi-step data generation process. The instruction data is first categorized into four high-level code-related tasks: Code Summarization, Code Generation, Code Translation, and Code Repair. This classification ensures the coverage of diverse programming scenarios while maintaining the quality and specificity of the data.

Central to their innovation is the Generator-Discriminator framework. Here, a LLM generator produces instructional data, which is then scrutinized by a discriminator. This discriminator is not a conventional binary classifier; rather, it follows a step-by-step rule-based examination to filter low-quality data and refine high-quality data. This dual-faceted approach enhances the data quality without overly relying on the capabilities of the teacher LLM.

Dataset and Instruction Data

The paper introduces CodeOcean, a dataset comprising 20,000 refined instruction instances spanning the aforementioned four tasks. The dataset is derived from the formidable CodeSearchNet corpus using a series of manual filtering rules and the KCenterGreedy algorithm to maximize diversity. This meticulous selection and curation process result in a set of highly diverse and high-quality instructional data, crucial for effective instruction tuning.

Experimental Results

Experiments were performed using various base models, including StarCoder, CodeLLaMa, and DeepseekCoder. These models were fine-tuned with the CodeOcean dataset. The resulting WaveCoder variants demonstrated superior performance in generalization across multiple code-related tasks.

Key numerical results from the paper include:

  • WaveCoder models achieved an impressive pass@1 score on the HumanEval benchmark, significantly outperforming other open-code models.
  • On the HumanEvalFix benchmark, which evaluates code repair tasks, WaveCoder-SC-15B achieved an average pass@1 of 33.0, while WaveCoder-DS-6.7B closely approached GPT-4's performance.
  • For code summarization tasks evaluated through the HumanEvalExplain benchmark, WaveCoder models demonstrated superior performance, consistently outperforming other models such as WizardCoder and OctoCoder.

These results highlight the effectiveness of the refined instruction data in enhancing model performance across multiple dimensions.

Implications and Future Work

The practical implications of this research are substantial. By refining the data generation process and employing a multi-task framework, the resulting models exhibit strong generalization capabilities without a significant tradeoff in performance for specific tasks. This positions WaveCoder as a highly versatile tool for a wide range of code-related applications, from automated code documentation to bug fixing and language translation.

Theoretically, this research underscores the importance of data quality and diversity in instruction tuning. The use of a discriminator to iteratively refine data provides a new avenue for researchers aiming to optimize training datasets for specialized LLM applications.

Future developments could explore enhancing the interplay among different tasks, potentially increasing the dataset size and further improving the efficiency of instruction tuning. There's also scope for investigating the integration of more sophisticated filtering criteria and the automated adaptation of the discriminator to different types of datasets or tasks.

Conclusion

In summary, the research on WaveCoder and the CodeOcean dataset provides new insights and tools for improving the efficacy and generalization of Code LLMs. The proposed LLM-based Generator-Discriminator framework marks a significant methodical advance in the generation of high-quality instructional data, significantly benefiting the instruction tuning process. This work lays an important foundation for future research aimed at further enhancing the capabilities and applications of LLMs in coding and beyond.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. OpenAI. Gpt-4 technical report, 2023.
  2. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  3. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  4. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  5. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  6. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  7. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  8. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022.
  9. Ext5: Towards extreme multi-task scaling for transfer learning. In International Conference on Learning Representations, 2022.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  12. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  13. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  14. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  15. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  16. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  17. Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4. arXiv preprint arXiv:2308.12067, 2023.
  18. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. Advances in Neural Information Processing Systems, 35:3082–3095, 2022.
  19. Understanding in-context learning from repetitions. arXiv preprint arXiv:2310.00297, 2023.
  20. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, 2022.
  21. Anonymous. Octopack: Instruction tuning code large language models. In Submitted to The Twelfth International Conference on Learning Representations, 2023. under review.
  22. DeepSeek. Deepseek coder: Let the code write itself. https://github.com/deepseek-ai/DeepSeek-Coder, 2023.
  23. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  24. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023.
  25. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  26. Instruction tuned models are quick learners. arXiv preprint arXiv:2306.05539, 2023.
  27. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
  28. Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. arXiv preprint arXiv:2305.09246, 2023.
  29. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936, 2023.
  30. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  31. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  32. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  33. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, 2021.
  34. Codet5+: Open code large language models for code understanding and generation. arXiv preprint, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhaojian Yu (5 papers)
  2. Xin Zhang (904 papers)
  3. Ning Shang (8 papers)
  4. Yangyu Huang (21 papers)
  5. Can Xu (98 papers)
  6. Yishujie Zhao (1 paper)
  7. Wenxiang Hu (10 papers)
  8. Qiufeng Yin (4 papers)
Citations (8)
Youtube Logo Streamline Icon: https://streamlinehq.com