DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (2406.11931v1)

Published 17 Jun 2024 in cs.SE, cs.AI, and cs.LG

Abstract: We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code LLM that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.

PDF HTML Abstract

DeepSeek-Coder-V2 is an open-source Mixture-of-Experts (MoE) LLM designed for code intelligence, aiming to bridge the performance gap with state-of-the-art closed-source models like GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro. The model is built upon the DeepSeek-V2 architecture and undergoes further pre-training on an additional 6 trillion tokens, totaling 10.2 trillion tokens. This continued pre-training significantly boosts its capabilities in coding and mathematical reasoning while preserving general language performance.

The pre-training dataset for DeepSeek-Coder-V2 consists of a multi-source corpus with a composition of 60% source code, 10% math, and 30% natural language. The source code corpus comprises 1,170 billion tokens collected from GitHub and CommonCrawl, expanding the coverage from 86 to 338 programming languages compared to the previous DeepSeek-Coder model (Guo et al., 25 Jan 2024 ). The math corpus includes 221 billion tokens from CommonCrawl. The natural language data is sampled from the DeepSeek-V2 training corpus (DeepSeek-AI, 7 May 2024 ).

DeepSeek-Coder-V2 comes in two sizes based on the DeepSeekMoE framework (Dai et al., 11 Jan 2024 ): a 16 billion total parameter version (Lite) with 2.4 billion active parameters and a 236 billion total parameter version with 21 billion active parameters. The MoE architecture allows for efficient inference by activating only a subset of parameters per token.

The training strategy involves two objectives for the 16B Lite model: Next-Token-Prediction and Fill-In-Middle (FIM) using the PSM (Prefix, Suffix, Middle) mode at a rate of 0.5. The 236B model uses only the Next-Token-Prediction objective. Training utilizes the AdamW optimizer (Loshchilov et al., 2017 ) with cosine learning rate decay and warm-up steps. The context length is extended from 16K to 128K tokens using the Yarn method (Peng et al., 2023 ), involving a two-stage training process with increasing sequence lengths. Evaluations using the Needle In A Haystack (NIAH) test confirm strong performance across the extended context window.

Alignment to human preferences and instruction following is achieved through a two-phase process: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). The SFT dataset combines code, math, and general instruction data. For RL, the Group Relative Policy Optimization (GRPO) algorithm (Shao et al., 5 Feb 2024 , Dai et al., 11 Jan 2024 ) is employed. Preference data for RL includes compiler feedback and test cases for code, ground-truth labels for math, and general instruction data. A reward model is trained on the compiler feedback data to provide a more robust training signal than raw compiler output.

DeepSeek-Coder-V2 demonstrates competitive performance across various benchmarks:

Code Generation: The 236B Instruct model achieves a 90.2% score on HumanEval (Chen et al., 2021 ) and 76.2% on MBPP+ [evalplus], positioning it competitively with top closed-source models and setting a new state-of-the-art for open-source models on MBPP+. It also performs strongly on multilingual HumanEval, LiveCodeBench (tying GPT-4o's overall score of 43.4%), and USACO. The 16B Lite Instruct model also performs well, often surpassing larger open-source counterparts.
Code Completion: Evaluated on the December subset of RepoBench v1.1 (Stein et al., 2023 ) and Single-Line Infilling tasks. The 16B Lite Base model, despite having only 2.4B active parameters, shows code completion capabilities comparable to much larger models like DeepSeek-Coder-Base 33B on Python and DeepSeek-Coder-Base 7B on Java. Its FIM training contributes to a high mean score (86.4%) on Single-Line Infilling, comparable to or better than other larger models.
Code Fixing: Tested on Defects4J, SWE-Bench (Jimenez et al., 2023 ), and Aider benchmarks. The 236B Instruct model shows strong results, achieving 21.0% on Defects4J, 12.7% on SWE-Bench, and an impressive 73.7% on Aider, surpassing all other models tested on Aider.
Code Understanding and Reasoning: Assessed using CRUXEval (Gu et al., 5 Jan 2024 ). The 236B Instruct model is the top open-source performer but shows a performance gap compared to the best closed-source models, potentially linked to its lower number of active parameters.
Mathematical Reasoning: Evaluated on GSM8K [gsm8k], MATH (Hendrycks et al., 2021 ), AIME 2024 [AIME], and Math Odyssey [netmindmath] using zero-shot chain-of-thought prompting. The 236B Instruct model achieves 75.7% on MATH and 53.7% on Math Odyssey, comparable to GPT-4o, and solves more AIME 2024 problems than other tested models, highlighting strong mathematical capabilities.
General Natural Language: Maintains strong general language performance, often outperforming DeepSeek-V2 on reasoning-heavy benchmarks like BBH (Suzgun et al., 2022 ) and Arena-Hard [arenahard2024], although it may trail slightly on knowledge-intensive tasks due to corpus differences.

DeepSeek-Coder-V2 is released publicly under a permissive license, supporting research and unrestricted commercial use. While achieving performance comparable to top closed-source models on many benchmarks, the paper notes a remaining gap in instruction-following for complex real-world programming tasks like SWE-Bench, identifying this as a key area for future improvement.