Papers
Topics
Authors
Recent
2000 character limit reached

DeepSeek-Coder-6.7B: Open-Source Code Generation Model

Updated 20 December 2025
  • DeepSeek-Coder-6.7B is an open-source, decoder-only Transformer model built for code generation and intelligence, trained on a curated 2 trillion-token corpus across 87 programming languages.
  • It employs innovative data selection via instruction-following difficulty with k-means clustering and dynamic pack tokenization to enhance training efficiency and performance.
  • Evaluations on benchmarks like HumanEval and MBPP show competitive results, supported by a scalable architecture featuring 32 layers, 4096 hidden dimensions, and extended context handling.

DeepSeek-Coder-6.7B is an open-source, decoder-only Transformer model within the DeepSeek-Coder series, purpose-built for code generation and code intelligence. Developed and released under a permissive license by the DeepSeek-AI group, it is trained from scratch on a project-level curated corpus consisting of 2 trillion tokens sampled from over 600 million files across 87 programming languages. As of its release, DeepSeek-Coder-6.7B demonstrates state-of-the-art performance among open-source models in code synthesis, infilling, and code understanding tasks, while incorporating data- and compute-efficient fine-tuning techniques designed to close the gap with leading proprietary systems (Guo et al., 2024, Lv et al., 17 Apr 2025).

1. Model Architecture

DeepSeek-Coder-6.7B implements a decoder-only Transformer backbone (GPT-style) with the following core architectural parameters:

  • Number of layers: 32 decoder blocks
  • Hidden state dimension: 4096
  • Feed-forward (FFN) dimension: 11,008 at pretraining; 16,384 for base fine-tuned variant
  • Number of attention heads: 32, each with 128-dimensional projections
  • Activation: SwiGLU nonlinearity
  • Positional encoding: Rotary Position Embedding (RoPE), with base frequency extended for 16K-token context adaptation
  • Tokenization: 32,000-vocabulary SentencePiece/BPE
  • Context window: 4,096 tokens (fine-tuning), 16,384 tokens (pretraining/adaptation)
  • Attention implementation: FlashAttention v2 acceleration

No architectural modifications are introduced beyond standard Transformer blocks at the 6.7B scale. Grouped-query attention (GQA) is not utilized at this scale (present only in the 33B variant). The pretrained model is capable of both causal next-token prediction and fill-in-the-middle (FIM) code infilling (Guo et al., 2024, Lv et al., 17 Apr 2025).

2. Pretraining Data and Corpus Construction

The DeepSeek-Coder-6.7B training corpus comprises approximately 2 trillion tokens (≈800 GB post-filtering) taken from 87 programming languages:

  • 87%: source code, project-level deduplicated to treat repositories as atomic units
  • 10%: code-related English text (e.g., GitHub markdown, StackExchange)
  • 3%: Chinese general-domain text

Preprocessing steps include explicit dependency extraction (via “import”/“include”/“using”), repository-level n-gram deduplication, rule-based filtering (to exclude excessively long or malformed files), and strict test set decontamination (removal of samples sharing 10-gram overlap with HumanEval, MBPP, GSM8k, MATH). Tokenization is performed with a 32,000-level BPE model derived from a corpus subset. File paths are injected as comments at file boundaries to enhance model navigation across project structure (Guo et al., 2024).

3. Training Objectives and Optimization

The model is trained with a mixture of autoregressive objectives:

  • Next-Token Prediction (NTP): Standard left-to-right token prediction, using cross-entropy loss.
  • Fill-in-the-Middle (FIM): At 50% of training steps, samples are randomly split into prefix, middle, and suffix segments and presented as "<fim_start>prefix<fim_hole>suffix<fim_end>", with the middle segment as the target.
  • Long-context adaptation: Rotary frequency scaling and >1,000 further steps on 16K-token segments are used to ensure robust performance for extended contexts.

Optimization uses the AdamW optimizer with a learning rate of 5e-5, weight decay of 0.01, cosine decay scheduling, a global batch size of 256, and three epochs of fine-tuning. All training is performed on 4× NVIDIA A100-80 GB GPUs using PyTorch FSDP (Lv et al., 17 Apr 2025).

4. Data-Efficient Fine-Tuning Strategies

“Data-efficient LLM Fine-tuning for Code Generation” (Lv et al., 17 Apr 2025) introduces two key methodologies—data selection and dynamic pack tokenization—yielding improved performance and resource efficiency.

4.1 Data Selection via Instruction-Following Difficulty (IFD) and Clustering

To identify the most informative training examples, each instruction-code pair xi=(Ii,Ci)x_i = (I_i, C_i) is scored:

  • Model perplexity without instruction: PPL(Ci)\mathop{\mathrm{PPL}}(C_i)
  • Model perplexity with instruction: PPL(CiIi)\mathop{\mathrm{PPL}}(C_i|I_i)
  • Instruction-Following Difficulty: IFD(xi)=PPL(CiIi)/PPL(Ci)\mathrm{IFD}(x_i) = \mathop{\mathrm{PPL}}(C_i|I_i) / \mathop{\mathrm{PPL}}(C_i)

Instructions are embedded using Sentence-BERT and clustered with kk-means (k=10k=10), enforcing proportional sampling within each cluster to preserve dataset distribution. The top m%m\% of high-IFD samples are selected in each cluster, where mm is a predefined sampling fraction (optimal at m=40%m=40\%). This dual strategy ensures both high-complexity and distributional coverage in the fine-tuning subset (Lv et al., 17 Apr 2025).

4.2 Dynamic Pack Tokenization

Instead of static padding to the largest sequence per batch, dynamic pack tokenization greedily concatenates short samples up to the context window LmaxL_{\mathrm{max}}, then pads only minimally within each “pack.” Empirically, this reduces the padding token fraction from 36.5% (static) to 15.2% (dynamic), a 58% reduction, conferring accelerated training (1.38×1.38\times speedup per epoch) and 1.44×1.44\times reduction in peak memory use (Lv et al., 17 Apr 2025).

5. Evaluation and Empirical Performance

DeepSeek-Coder-6.7B is benchmarked against open- and closed-source code models. Key results include:

  • HumanEval (zero-shot, pass@1): 44.7% (Base), 66.1% (Instruct; Alpaca-style fine-tuning)
  • MBPP (few-shot, pass@1): 60.6% (Base), 65.4% (Instruct)
  • Single-line infilling (Python/Java/JavaScript, mean): 78.1%
  • DS-1000 data science workflows (pass@1): 30.5%
  • Cross-file code completion (Python EM): 9.5% (Base), up to 16.1% (+retrieval)
  • LeetCode Contest (pass@1, with chain-of-thought prompt): 21.1%

Ablation studies on OSS-Instruct (75K samples) reveal that targeted fine-tuning (40% subset, IFD+KMeans selection) yields superior average performance (66.9%) versus full dataset (66.1%), as well as speed and memory improvements. Random and single-metric selection (IFD-only, KMeans-only) are consistently outperformed by the combined approach (Guo et al., 2024, Lv et al., 17 Apr 2025).

Data Selection Strategy HumanEval MBPP
Random 62.2% 75.7%
IFD-only 62.8% 74.7%
KMeans-only 62.2% 72.9%
IFD+KMeans (best) 68.3% 75.9%

Optimal performance is reached at a 40% sampling rate; inclusion of lower-IFD (noisier) data beyond this threshold does not improve results.

6. Theoretical Rationale and Generalization

High-IFD samples maximize the instruction-conditioned perplexity gap, providing greater gradient signal per token and concentrated learning on algorithmically challenging or nuanced prompts. Cluster-wise proportional sampling guards against distributional shift and overfitting to subdomains. The dynamic packing approach increases the number of effective (nonpad) tokens per forward pass, lowering both computational load (FLOPs) and GPU memory consumption. These techniques generalize to any decoder-only Transformer in the 6–7B parameter regime and similar code LLMs such as CodeLlama-7B and StarCoder-6.7B (Lv et al., 17 Apr 2025).

7. Licensing and Community Access

DeepSeek-Coder-6.7B is distributed under a permissive Apache 2.0/MIT-style license, explicitly permitting both academic and commercial use, redistribution, and derivative works without restrictive obligations. Model weights and code are available at https://github.com/deepseek-ai/DeepSeek-Coder (Guo et al., 2024).


DeepSeek-Coder-6.7B synthesizes advances in scalable Transformer pretraining, code-specific dataset curation, and principled fine-tuning, achieving competitive results in code intelligence while optimizing resource efficiency through data selection and tokenization strategies (Guo et al., 2024, Lv et al., 17 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DeepSeek-Coder-6.7B Model.