Qwen2.5-Coder Technical Report (2409.12186v3)

Published 18 Sep 2024 in cs.CL

Abstract: In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and math skills. These models have been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will advance research in code intelligence and, with its permissive licensing, support wider adoption by developers in real-world applications.

Citations (43)

View on Semantic Scholar

Summary

The paper presents Qwen2.5-Coder, a dual-model series (1.5B and 7B) optimized for superior code generation and reasoning tasks.
It employs a three-stage training pipeline—file-level, repo-level, and instruction tuning—that ensures robust performance across key benchmarks.
Results demonstrate that Qwen2.5-Coder outperforms similarly sized models in code completion, logical reasoning, and mathematical tasks, setting new standards in AI code modeling.

Technical Summary of "Qwen2.5-Coder Technical Report"

The technical report introduces Qwen2.5-Coder, a major advancement in code-specific LLMing aimed at addressing the growing demands of code generation and related tasks. This new series is built upon the Qwen2.5 architecture, comprising two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B, both of which are pretrained on an extensive dataset of over 5.5 trillion tokens. The meticulous approach to data preparation, including rigorous cleaning and scalable synthetic data generation, underpins Qwen2.5-Coder’s exceptional performance.

Model Architecture

The Qwen2.5-Coder models leverage the robust architecture of Qwen2.5, ensuring advanced capabilities in both code generation and general-purpose language tasks.

1.5B Model: This model comprises 28 layers with a hidden size of 1,536, utilizing 12 query heads and 2 key-value heads.
7B Model: With the same number of layers, this larger model features a hidden size of 3,584, utilizing 28 query heads and 4 key-value heads.

Both models share a vocabulary size of 151,646 tokens and enhancements like Fill-in-the-Middle (FIM) tokens for improved code understanding and generation.

Data Preparation

The pretraining data for the Qwen2.5-Coder series includes a broad collection of source code, text-code grounding data, synthetic data, mathematical data, and general text data. This comprehensive approach ensures a well-rounded dataset, crucial for robust model training. The data cleaning process is marked by the use of weak classifiers and scorers, which meticulously filter out low-quality content, maintaining the dataset’s integrity.

Training Pipeline

A three-stage training pipeline is adopted for Qwen2.5-Coder:

File-Level Pretraining: Focuses on individual code files with sequence lengths up to 8,192 tokens.
Repo-Level Pretraining: Extends the context length up to 32,768 tokens using methods like RoPE and YARN.
Instruction Tuning: Uses a diverse set of coding problems and solutions for fine-tuning, transitioning the models into effective coding assistants.

This pipeline ensures comprehensive coverage and adaptation to various code-related scenarios.

Evaluation

Code Generation

Qwen2.5-Coder exhibits state-of-the-art performance across several benchmarks:

HumanEval and MBPP: Both models outperform similarly sized open-source models in code generation tasks. The 7B model surpasses even larger models like the DS-Coder-33B.
BigCodeBench-Complete: Demonstrates strong performance, further validating its generalization capabilities.
Multi-Programming Language: Achieves impressive results on the MultiPL-E benchmark across various programming languages.

Code Completion

In tasks such as HumanEval Infilling, Qwen2.5-Coder models outperform other models in their size category, showcasing their supreme capability in code completion tasks leveraging FIM training strategies.

Code Reasoning

Within benchmarks like CRUXEval, which test the logical reasoning based on code, Qwen2.5-Coder exhibits substantial performance, outperforming previous advancements in the field.

Mathematical Reasoning

Qwen2.5-Coder also excels in mathematical reasoning tasks, benefiting from the integration of mathematical data in pretraining, which bolsters its proficiency in both mathematical and coding domains.

General Natural Language Understanding

Qwen2.5-Coder maintains impressive performance in general natural language understanding, as evident from scores on MMLU, ARC-Challenge, TruthfulQA, WinoGrande, and HellaSwag. This balanced data preparation ensures the model retains general-purpose language capabilities while excelling in code-related tasks.

Long-Context Evaluation

The model’s long-context capabilities, evaluated through tasks like Needle in the Code, demonstrate its aptitude in understanding large codebases, establishing Qwen2.5-Coder as a significant step forward in creating practical code models.

Instruction Models

The Qwen2.5-Coder instruct models further enhance these capabilities. They achieve superior performance on benchmarks like HumanEval+ and LiveCodeBench, outperforming both similarly-sized and larger models in tasks such as code generation, reasoning, and editing.

Future Implications

The advancements introduced by Qwen2.5-Coder suggest noteworthy implications for both theoretical research and practical applications. The open-source nature of these models encourages widespread adoption and further innovation in code intelligence research. Anticipated future developments include exploring the scalability of models concerning data and parameter sizes and enhancing their reasoning capabilities.

Conclusion

The introduction of Qwen2.5-Coder marks a significant milestone in the domain of code-specific LLMs. Through precise and broad-spectrum data training and rigorous evaluation, these models set a new standard in multiple programming and reasoning tasks. The future trajectory of such models promises significant breakthroughs in both the scale and application of AI in code generation and related fields.

PDF Markdown

Related Papers

Tweets

https://twitter.com/huybery/status/1837170643563073960

https://twitter.com/FirstBatchAI/status/1838233448873439645

https://twitter.com/_The_AI_Guy_/status/1838303861657809055

https://twitter.com/jameswlepage/status/1838103196348154365

https://twitter.com/SebastianB929/status/1837768931559305700

https://twitter.com/strnr/status/1869782292710690925