Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (2406.11931v1)

Published 17 Jun 2024 in cs.SE, cs.AI, and cs.LG

Abstract: We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code LLM that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.

DeepSeek-Coder-V2 is an open-source Mixture-of-Experts (MoE) LLM designed for code intelligence, aiming to bridge the performance gap with state-of-the-art closed-source models like GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro. The model is built upon the DeepSeek-V2 architecture and undergoes further pre-training on an additional 6 trillion tokens, totaling 10.2 trillion tokens. This continued pre-training significantly boosts its capabilities in coding and mathematical reasoning while preserving general language performance.

The pre-training dataset for DeepSeek-Coder-V2 consists of a multi-source corpus with a composition of 60% source code, 10% math, and 30% natural language. The source code corpus comprises 1,170 billion tokens collected from GitHub and CommonCrawl, expanding the coverage from 86 to 338 programming languages compared to the previous DeepSeek-Coder model (Guo et al., 25 Jan 2024 ). The math corpus includes 221 billion tokens from CommonCrawl. The natural language data is sampled from the DeepSeek-V2 training corpus (DeepSeek-AI, 7 May 2024 ).

DeepSeek-Coder-V2 comes in two sizes based on the DeepSeekMoE framework (Dai et al., 11 Jan 2024 ): a 16 billion total parameter version (Lite) with 2.4 billion active parameters and a 236 billion total parameter version with 21 billion active parameters. The MoE architecture allows for efficient inference by activating only a subset of parameters per token.

The training strategy involves two objectives for the 16B Lite model: Next-Token-Prediction and Fill-In-Middle (FIM) using the PSM (Prefix, Suffix, Middle) mode at a rate of 0.5. The 236B model uses only the Next-Token-Prediction objective. Training utilizes the AdamW optimizer (Loshchilov et al., 2017 ) with cosine learning rate decay and warm-up steps. The context length is extended from 16K to 128K tokens using the Yarn method (Peng et al., 2023 ), involving a two-stage training process with increasing sequence lengths. Evaluations using the Needle In A Haystack (NIAH) test confirm strong performance across the extended context window.

Alignment to human preferences and instruction following is achieved through a two-phase process: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). The SFT dataset combines code, math, and general instruction data. For RL, the Group Relative Policy Optimization (GRPO) algorithm (Shao et al., 5 Feb 2024 , Dai et al., 11 Jan 2024 ) is employed. Preference data for RL includes compiler feedback and test cases for code, ground-truth labels for math, and general instruction data. A reward model is trained on the compiler feedback data to provide a more robust training signal than raw compiler output.

DeepSeek-Coder-V2 demonstrates competitive performance across various benchmarks:

  • Code Generation: The 236B Instruct model achieves a 90.2% score on HumanEval (Chen et al., 2021 ) and 76.2% on MBPP+ [evalplus], positioning it competitively with top closed-source models and setting a new state-of-the-art for open-source models on MBPP+. It also performs strongly on multilingual HumanEval, LiveCodeBench (tying GPT-4o's overall score of 43.4%), and USACO. The 16B Lite Instruct model also performs well, often surpassing larger open-source counterparts.
  • Code Completion: Evaluated on the December subset of RepoBench v1.1 (Stein et al., 2023 ) and Single-Line Infilling tasks. The 16B Lite Base model, despite having only 2.4B active parameters, shows code completion capabilities comparable to much larger models like DeepSeek-Coder-Base 33B on Python and DeepSeek-Coder-Base 7B on Java. Its FIM training contributes to a high mean score (86.4%) on Single-Line Infilling, comparable to or better than other larger models.
  • Code Fixing: Tested on Defects4J, SWE-Bench (Jimenez et al., 2023 ), and Aider benchmarks. The 236B Instruct model shows strong results, achieving 21.0% on Defects4J, 12.7% on SWE-Bench, and an impressive 73.7% on Aider, surpassing all other models tested on Aider.
  • Code Understanding and Reasoning: Assessed using CRUXEval (Gu et al., 5 Jan 2024 ). The 236B Instruct model is the top open-source performer but shows a performance gap compared to the best closed-source models, potentially linked to its lower number of active parameters.
  • Mathematical Reasoning: Evaluated on GSM8K [gsm8k], MATH (Hendrycks et al., 2021 ), AIME 2024 [AIME], and Math Odyssey [netmindmath] using zero-shot chain-of-thought prompting. The 236B Instruct model achieves 75.7% on MATH and 53.7% on Math Odyssey, comparable to GPT-4o, and solves more AIME 2024 problems than other tested models, highlighting strong mathematical capabilities.
  • General Natural Language: Maintains strong general language performance, often outperforming DeepSeek-V2 on reasoning-heavy benchmarks like BBH (Suzgun et al., 2022 ) and Arena-Hard [arenahard2024], although it may trail slightly on knowledge-intensive tasks due to corpus differences.

DeepSeek-Coder-V2 is released publicly under a permissive license, supporting research and unrestricted commercial use. While achieving performance comparable to top closed-source models on many benchmarks, the paper notes a remaining gap in instruction-following for complex real-world programming tasks like SWE-Bench, identifying this as a key area for future improvement.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (40)
  1. DeepSeek-AI (5 papers)
  2. Qihao Zhu (27 papers)
  3. Daya Guo (37 papers)
  4. Zhihong Shao (20 papers)
  5. Dejian Yang (11 papers)
  6. Peiyi Wang (48 papers)
  7. Runxin Xu (30 papers)
  8. Y. Wu (639 papers)
  9. Yukun Li (34 papers)
  10. Huazuo Gao (9 papers)
  11. Shirong Ma (23 papers)
  12. Wangding Zeng (5 papers)
  13. Xiao Bi (8 papers)
  14. Zihui Gu (7 papers)
  15. Hanwei Xu (8 papers)
  16. Damai Dai (38 papers)
  17. Kai Dong (15 papers)
  18. Liyue Zhang (11 papers)
  19. Yishi Piao (5 papers)
  20. Zhibin Gou (15 papers)
Citations (83)
Youtube Logo Streamline Icon: https://streamlinehq.com