Papers
Topics
Authors
Recent
2000 character limit reached

DeepSeek-V2: Sparse MoE Language Model

Updated 1 January 2026
  • DeepSeek-V2 is a large-scale sparse Mixture-of-Experts language model featuring 236B total parameters with only 21B activated per token for efficiency.
  • It employs novel DeepSeekMoE layers and Multi-Head Latent Attention to drastically reduce memory footprint and training cost while maintaining state-of-the-art accuracy.
  • Empirical results reveal superior performance on benchmarks with up to 128k token context support, reflecting significant gains in inference throughput and resource efficiency.

DeepSeek-V2 is a large-scale sparse Mixture-of-Experts (MoE) LLM engineered for state-of-the-art natural language understanding and generation, with an emphasis on computational efficiency, flexible context capacity, and cost-effective training and inference. With a total capacity of 236B parameters and only 21B activated per token, DeepSeek-V2 leverages two major architectural advances—DeepSeekMoE sparse Feed-Forward layers and Multi-Head Latent Attention (MLA)—to surpass contemporaneous open-source LLMs in accuracy and throughput, while drastically reducing training cost and memory footprint (DeepSeek-AI et al., 2024).

1. Model Scale, Sparsity, and Architectural Overview

DeepSeek-V2 comprises 236B total model parameters, of which approximately 9% (21B) are active per token. This high-sparsity configuration is realized by integrating MoE FFN sublayers (DeepSeekMoE) and deploying MLA for scalable and memory-efficient self-attention. Each transformer block combines:

  • Sparse Feed-Forward (DeepSeekMoE): Replaces dense FFNs with a dual-tier expert structure, consisting of a small set of shared experts and a larger pool of routed experts. For each token at layer â„“\ell, the output is

ht′=ut+∑i=1NsFFNi(s)(ut)+∑i=1Nrgi,t⋅FFNi(r)(ut)h_t' = u_t + \sum_{i=1}^{N_s} \mathrm{FFN}_i^{(s)}(u_t) + \sum_{i=1}^{N_r} g_{i,t}\cdot \mathrm{FFN}_i^{(r)}(u_t)

where routing {gi,t}\{g_{i,t}\} is determined by a softmax over token-expert affinities.

  • Sparse Attention (MLA): MLA compresses standard multi-head attention's Key-Value (KV) cache from 2 nh dh L2\, n_h\, d_h\, L elements per token (where nhn_h is number of heads, dhd_h head dimension, LL layers) to (dc+dhR)L(d_c + d_h^R)L, with dc=4 dhd_c=4\, d_h, dhR=dh/2d_h^R=d_h/2, resulting in a 93.3% KV cache reduction.

Overall, DeepSeek-V2 achieves efficient scaling in both model capacity and context length, supporting up to 128k context tokens via long-context adaptation.

2. DeepSeekMoE and Sparse Computation Mechanisms

DeepSeekMoE implements a two-level expert system within each MoE layer: NsN_s shared experts, always applied, and NrN_r routed experts, among which KrK_r are selected per token by a router gating mechanism. The router computes token-to-expert affinity

si,t=Softmaxi(ut⊤ei)s_{i,t} = \mathrm{Softmax}_i(u_t^\top e_i)

and activates the top-KrK_r experts for input utu_t.

To maintain balanced expert participation and prevent router collapse, three load-balancing losses are added:

  • Expert-level balance: LExpBal=α1∑i=1Nrfiâ‹…Pi\mathcal{L}_{\mathrm{ExpBal}} = \alpha_1 \sum_{i=1}^{N_r} f_i \cdot P_i
  • Device-level balance: LDevBal=α2∑i=1Dfi′⋅Pi′\mathcal{L}_{\mathrm{DevBal}} = \alpha_2 \sum_{i=1}^D f_i' \cdot P_i'
  • Communication balance: LCommBal=α3∑i=1Dfi′′⋅Pi′′\mathcal{L}_{\mathrm{CommBal}} = \alpha_3 \sum_{i=1}^D f_i'' \cdot P_i''

This expert selection paradigm yields strong specialization and task-adaptive expressivity, with only a fraction of parameters actively processed per token.

3. Multi-Head Latent Attention and Memory Compression

MLA replaces standard multi-head attention's KV cache with joint latent projections:

  • Each key and value from attended tokens are projected into a latent vector ctKV=WDKVhtc^{KV}_t = W^{DKV}h_t of reduced dimension dcd_c.
  • At inference, only this compressed latent and a compact RoPE positional embedding are cached, yielding approximately 2.25 nh dh L2.25\, n_h\, d_h\, L cached elements compared to 2 nh dh L2\, n_h\, d_h\, L for MHA, and actual KV element counts as low as 4% of the original for a 60-layer configuration.

Consequently, DeepSeek-V2 realizes a 93.3% memory reduction in KV-cache usage over a dense baseline (DeepSeek-67B), alleviating bandwidth and latency bottlenecks during large-context inference and facilitating deployment on memory-constrained hardware.

4. Training Regimen and Alignment Pipeline

Pre-training utilizes a bilingual (English/Chinese), multi-source corpus of 8.1T tokens, covering web, books, code, and filtered for safety and mathematical content. Key hyperparameters and settings include:

  • 60 transformer layers, hidden size 5120, 128 heads (head dimension 128), FFN width 4×51204\times5120.
  • MoE layers: 2 shared + 160 routed experts/layer, expert width 1536, with Kr=6K_r=6 active routed experts per token.
  • Optimizer: AdamW (β1=0.9\beta_1=0.9, β2=0.95\beta_2=0.95, weight decay 0.1), peak LR 2.4×10−42.4\times10^{-4}, progressive batch size ramp-up, parallelized via pipeline, expert, and ZeRO-1 data parallelism.

Long-context extension to 128k tokens is realized through YaRN scaling of RoPE matrices (settings: s=40s=40, α=1\alpha=1, β=32\beta=32) and brief fine-tuning at 32k context, subsequently generalizing to full 128k context as validated on "Needle in a Haystack" tasks.

Supervised fine-tuning (SFT) applies 1.5M high-quality human-generated conversations (1.2M helpfulness, 0.3M safety), and instruction tuning for dialogue consistency. Reinforcement learning alignment employs Group Relative Policy Optimization (GRPO), incorporating a two-stage reward: first with reasoning reward models (code/math), then with multi-reward alignment (helpfulness, safety, rule-based scoring), explicitly without a critic network.

5. Empirical Results and Efficiency Gains

DeepSeek-V2 demonstrates competitive or superior performance to contemporary open-source models on language understanding and reasoning benchmarks. Key findings include:

  • Accuracy: On MMLU (5-shot), with only 21B activated parameters, DeepSeek-V2 achieves 78.5%, outperforming many dense models of greater active size (e.g., DeepSeek-67B: 71.3%, Qwen1.5-72B: 77.2%, Mixtral-8x22B: 77.6%) and approaching LLaMA3-70B (78.9%).
  • Code and math: HumanEval: 48.8% (DeepSeek-V2), GSM8K (8-shot): 79.2%.
  • Training cost: 42.5% reduction per trillion tokens compared to dense DeepSeek-67B (172.8k GPU-hr/T vs. 300.6k GPU-hr/T).
  • Inference throughput: On 8×H800, DeepSeek-V2 achieves >50,000 tokens/s (5.76× DeepSeek-67B).
  • KV-cache compression: MLA reduces per-token KV from ~860k elements (MHA) to ~35k (4% of MHA baseline).

These results confirm that sparse MoE architectures plus MLA attention yield significant FLOP, memory, and bandwidth savings without compromising SOTA language modeling accuracy (DeepSeek-AI et al., 2024).

6. Context Length and Inference Scaling

DeepSeek-V2 natively supports a 4k token window; following YaRN adaptation, the model operates on up to 128k-token contexts. MLA ensures per-token memory requirements for attention are nearly independent of context length, and optimized kernels (e.g., FlashAttention-2) maintain sublinear inference latency even at large context sizes. This profile is especially advantageous for applications in document-level understanding, retrieval-augmented generation, and settings where extended context and parallel user load are critical, such as online services or multi-tenant deployments.

7. Applications, Trade-Offs, and Practical Considerations

DeepSeek-V2 offers robust few-shot and zero-shot performance in English and Chinese, enabling diverse downstream use cases. Its strength lies in:

  • Ultra-economical training and inference via highly sparse expert selection and MLA memory savings.
  • Maximum context scaling without penalty on inference hardware.
  • Flexible parameter footprint: fine control over active parameter budget per token.

The model introduces added engineering complexity (e.g., MoE routing, balance losses, parallelism strategies). There is a minor trade-off in post-RL alignment for certain benchmark tasks, indicative of the balance between alignment and task specialization. Recommended usage scenarios include cost-sensitive inference, retrieval-augmented LLM tasks, and applications requiring persistent long-context support.

DeepSeek-V2 demonstrates that large-scale sparse MoE LLMs with joint low-rank attention compression achieve practical, scalable, and state-of-the-art performance while sharply reducing resource requirements, thus charting a direction for economical deployment of LLMs at ultra-large scale (DeepSeek-AI et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DeepSeek-V2.