Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WizardCoder: Empowering Code Large Language Models with Evol-Instruct (2306.08568v1)

Published 14 Jun 2023 in cs.CL and cs.AI
WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Abstract: Code LLMs (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM

Introduction

The landscape of Code LLMs (Code LLMs) has dramatically evolved with the introduction of various pre-trained models demonstrating proficiency in coding tasks. Open-source options like StarCoder have received significant acclaim. Yet, most of these models have largely been trained on code data alone, without the benefits of instruction fine-tuning. Building on the recent developments in general domain fine-tuning and the Evol-Instruct method, introduced by WizardLM, this paper presents WizardCoder, an enhancement to StarCoder that integrates complex instruction fine-tuning specific to coding tasks.

Related Work

In contextualizing WizardCoder, this research builds upon two primary foundations: open-source Code LLMs pre-trained on extensive code datasets and the methodology of instruction fine-tuning that has been largely explored in NLP tasks. Previous models, such as InstructGPT by OpenAI, have attempted to demonstrate the value of human-annotator provided instructions. Recent contributions like Alpaca and Vicuna further explored the potential of instruction fine-tuning, albeit in the general domain. WizardLM's Evol-Instruct method distinguished itself by evolving existing instruction data, signaling the potential for application in the code domain leading to the inception of WizardCoder.

Approach

WizardCoder employs an adapted Evol-Instruct method designed to evolve code instructions within the Code Alpaca dataset. This enables fine-tuning of StarCoder with an evolved set of code instruction-following training data. The researchers introduced evolutionary instructions that include code debugging and time-space complexity constraints, which are unique to the programming domain. The methodology ensures evolutionary prompts that augment the difficulty of the programming tasks. One observes that the empirical success of WizardCoder on several benchmarks is attributed to this nuanced approach of instruction fine-tuning.

Experimentation and Results

A rigorous experimentation framework was established utilizing multiple code generation benchmarks. WizardCoder outshines all open-source Code LLMs in these benchmarks, including its precursor, StarCoder. Notably, on prominent benchmarks such as HumanEval, it surpasses even the top closed-source LLMs, which is a remarkable feat for an open-source model of its size. The paper provides detailed comparative analysis, placing WizardCoder in the upper echelons of Code LLM performance. Furthermore, the ablation paper confirms the efficacy of the number of data evolution rounds carried out, providing insights into fine-tuning methodologies.

Conclusion and Implications

The paper concludes with WizardCoder positioned as a state-of-the-art model that advances the field of code generation through instruction fine-tuning. It successfully applies the Evol-Instruct method, previously proven in the general domain, to the specific challenges of coding tasks. Looking ahead, the researchers point out the potential enhancements to WizardCoder and the need for continual improvement to meet and exceed the benchmarks set by models like GPT-4. Reflecting on the broader impact, the authors acknowledge the ethical considerations paralleling those of other LLMs and emphasize the necessity of research towards responsible use and deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  2. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  3. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022.
  4. Palm 2 technical report. CoRR, abs/2305.10403, 2023.
  5. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022.
  6. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021.
  7. GLM-130B: an open bilingual pre-trained model. CoRR, abs/2210.02414, 2022.
  8. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  9. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022.
  10. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  11. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  12. Competition-level code generation with alphacode. CoRR, abs/2203.07814, 2022.
  13. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  14. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. CoRR, abs/2303.17568, 2023.
  15. Incoder: A generative model for code infilling and synthesis. CoRR, abs/2204.05999, 2022.
  16. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
  17. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 8696–8708. Association for Computational Linguistics, 2021.
  18. Codet5+: Open code large language models for code understanding and generation. CoRR, abs/2305.07922, 2023.
  19. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
  20. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  21. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022.
  22. Ext5: Towards extreme multi-task scaling for transfer learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  23. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  24. Zeroprompt: Scaling prompt-based pretraining to 1, 000 tasks improves zero-shot generalization. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 4235–4252. Association for Computational Linguistics, 2022.
  25. Unifiedqa: Crossing format boundaries with a single QA system. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1896–1907. Association for Computational Linguistics, 2020.
  26. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  27. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  28. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  29. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  30. Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
  31. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. CoRR, abs/2305.01210, 2023.
  32. Program synthesis with large language models. CoRR, abs/2108.07732, 2021.
  33. DS-1000: A natural and reliable benchmark for data science code generation. CoRR, abs/2211.11501, 2022.
  34. Gpt-neox-20b: An open-source autoregressive language model. CoRR, abs/2204.06745, 2022.
  35. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  36. Unifying language learning paradigms. CoRR, abs/2205.05131, 2022.
  37. Microsoft. Azure openai service models. https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models, 2023.
  38. Llm humaneval benchmarks. https://github.com/my-other-github-account/llm-humaneval-benchmarks, 2023.
  39. Lamda: Language models for dialog applications. CoRR, abs/2201.08239, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Ziyang Luo (35 papers)
  2. Can Xu (98 papers)
  3. Pu Zhao (82 papers)
  4. Qingfeng Sun (40 papers)
  5. Xiubo Geng (36 papers)
  6. Wenxiang Hu (10 papers)
  7. Chongyang Tao (61 papers)
  8. Jing Ma (136 papers)
  9. Qingwei Lin (81 papers)
  10. Daxin Jiang (138 papers)
Citations (497)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com