DeepSeek-V2: Sparse MoE Language Model

Updated 1 January 2026

DeepSeek-V2 is a large-scale sparse Mixture-of-Experts language model featuring 236B total parameters with only 21B activated per token for efficiency.
It employs novel DeepSeekMoE layers and Multi-Head Latent Attention to drastically reduce memory footprint and training cost while maintaining state-of-the-art accuracy.
Empirical results reveal superior performance on benchmarks with up to 128k token context support, reflecting significant gains in inference throughput and resource efficiency.

DeepSeek-V2 is a large-scale sparse Mixture-of-Experts (MoE) LLM engineered for state-of-the-art natural language understanding and generation, with an emphasis on computational efficiency, flexible context capacity, and cost-effective training and inference. With a total capacity of 236B parameters and only 21B activated per token, DeepSeek-V2 leverages two major architectural advances—DeepSeekMoE sparse Feed-Forward layers and Multi-Head Latent Attention (MLA)—to surpass contemporaneous open-source LLMs in accuracy and throughput, while drastically reducing training cost and memory footprint (DeepSeek-AI et al., 2024).

1. Model Scale, Sparsity, and Architectural Overview

DeepSeek-V2 comprises 236B total model parameters, of which approximately 9% (21B) are active per token. This high-sparsity configuration is realized by integrating MoE FFN sublayers (DeepSeekMoE) and deploying MLA for scalable and memory-efficient self-attention. Each transformer block combines:

Sparse Feed-Forward (DeepSeekMoE): Replaces dense FFNs with a dual-tier expert structure, consisting of a small set of shared experts and a larger pool of routed experts. For each token at layer $\ell$ , the output is

$h_t' = u_t + \sum_{i=1}^{N_s} \mathrm{FFN}_i^{(s)}(u_t) + \sum_{i=1}^{N_r} g_{i,t}\cdot \mathrm{FFN}_i^{(r)}(u_t)$

where routing $\{g_{i,t}\}$ is determined by a softmax over token-expert affinities.

Sparse Attention (MLA): MLA compresses standard multi-head attention's Key-Value (KV) cache from $2\, n_h\, d_h\, L$ elements per token (where $n_h$ is number of heads, $d_h$ head dimension, $L$ layers) to $(d_c + d_h^R)L$ , with $d_c=4\, d_h$ , $d_h^R=d_h/2$ , resulting in a 93.3% KV cache reduction.

Overall, DeepSeek-V2 achieves efficient scaling in both model capacity and context length, supporting up to 128k context tokens via long-context adaptation.

2. DeepSeekMoE and Sparse Computation Mechanisms

DeepSeekMoE implements a two-level expert system within each MoE layer: $N_s$ shared experts, always applied, and $N_r$ routed experts, among which $K_r$ are selected per token by a router gating mechanism. The router computes token-to-expert affinity

$s_{i,t} = \mathrm{Softmax}_i(u_t^\top e_i)$

and activates the top- $K_r$ experts for input $u_t$ .

To maintain balanced expert participation and prevent router collapse, three load-balancing losses are added:

Expert-level balance: $\mathcal{L}_{\mathrm{ExpBal}} = \alpha_1 \sum_{i=1}^{N_r} f_i \cdot P_i$
Device-level balance: $\mathcal{L}_{\mathrm{DevBal}} = \alpha_2 \sum_{i=1}^D f_i' \cdot P_i'$
Communication balance: $\mathcal{L}_{\mathrm{CommBal}} = \alpha_3 \sum_{i=1}^D f_i'' \cdot P_i''$

This expert selection paradigm yields strong specialization and task-adaptive expressivity, with only a fraction of parameters actively processed per token.

3. Multi-Head Latent Attention and Memory Compression

MLA replaces standard multi-head attention's KV cache with joint latent projections:

Each key and value from attended tokens are projected into a latent vector $c^{KV}_t = W^{DKV}h_t$ of reduced dimension $d_c$ .
At inference, only this compressed latent and a compact RoPE positional embedding are cached, yielding approximately $2.25\, n_h\, d_h\, L$ cached elements compared to $2\, n_h\, d_h\, L$ for MHA, and actual KV element counts as low as 4% of the original for a 60-layer configuration.

Consequently, DeepSeek-V2 realizes a 93.3% memory reduction in KV-cache usage over a dense baseline (DeepSeek-67B), alleviating bandwidth and latency bottlenecks during large-context inference and facilitating deployment on memory-constrained hardware.

4. Training Regimen and Alignment Pipeline

Pre-training utilizes a bilingual (English/Chinese), multi-source corpus of 8.1T tokens, covering web, books, code, and filtered for safety and mathematical content. Key hyperparameters and settings include:

60 transformer layers, hidden size 5120, 128 heads (head dimension 128), FFN width $4\times5120$ .
MoE layers: 2 shared + 160 routed experts/layer, expert width 1536, with $K_r=6$ active routed experts per token.
Optimizer: AdamW ( $\beta_1=0.9$ , $\beta_2=0.95$ , weight decay 0.1), peak LR $2.4\times10^{-4}$ , progressive batch size ramp-up, parallelized via pipeline, expert, and ZeRO-1 data parallelism.

Long-context extension to 128k tokens is realized through YaRN scaling of RoPE matrices (settings: $s=40$ , $\alpha=1$ , $\beta=32$ ) and brief fine-tuning at 32k context, subsequently generalizing to full 128k context as validated on "Needle in a Haystack" tasks.

Supervised fine-tuning (SFT) applies 1.5M high-quality human-generated conversations (1.2M helpfulness, 0.3M safety), and instruction tuning for dialogue consistency. Reinforcement learning alignment employs Group Relative Policy Optimization (GRPO), incorporating a two-stage reward: first with reasoning reward models (code/math), then with multi-reward alignment (helpfulness, safety, rule-based scoring), explicitly without a critic network.

5. Empirical Results and Efficiency Gains

DeepSeek-V2 demonstrates competitive or superior performance to contemporary open-source models on language understanding and reasoning benchmarks. Key findings include:

Accuracy: On MMLU (5-shot), with only 21B activated parameters, DeepSeek-V2 achieves 78.5%, outperforming many dense models of greater active size (e.g., DeepSeek-67B: 71.3%, Qwen1.5-72B: 77.2%, Mixtral-8x22B: 77.6%) and approaching LLaMA3-70B (78.9%).
Code and math: HumanEval: 48.8% (DeepSeek-V2), GSM8K (8-shot): 79.2%.
Training cost: 42.5% reduction per trillion tokens compared to dense DeepSeek-67B (172.8k GPU-hr/T vs. 300.6k GPU-hr/T).
Inference throughput: On 8×H800, DeepSeek-V2 achieves >50,000 tokens/s (5.76× DeepSeek-67B).
KV-cache compression: MLA reduces per-token KV from ~860k elements (MHA) to ~35k (4% of MHA baseline).

These results confirm that sparse MoE architectures plus MLA attention yield significant FLOP, memory, and bandwidth savings without compromising SOTA language modeling accuracy (DeepSeek-AI et al., 2024).

6. Context Length and Inference Scaling

DeepSeek-V2 natively supports a 4k token window; following YaRN adaptation, the model operates on up to 128k-token contexts. MLA ensures per-token memory requirements for attention are nearly independent of context length, and optimized kernels (e.g., FlashAttention-2) maintain sublinear inference latency even at large context sizes. This profile is especially advantageous for applications in document-level understanding, retrieval-augmented generation, and settings where extended context and parallel user load are critical, such as online services or multi-tenant deployments.

7. Applications, Trade-Offs, and Practical Considerations

DeepSeek-V2 offers robust few-shot and zero-shot performance in English and Chinese, enabling diverse downstream use cases. Its strength lies in:

Ultra-economical training and inference via highly sparse expert selection and MLA memory savings.
Maximum context scaling without penalty on inference hardware.
Flexible parameter footprint: fine control over active parameter budget per token.

The model introduces added engineering complexity (e.g., MoE routing, balance losses, parallelism strategies). There is a minor trade-off in post-RL alignment for certain benchmark tasks, indicative of the balance between alignment and task specialization. Recommended usage scenarios include cost-sensitive inference, retrieval-augmented LLM tasks, and applications requiring persistent long-context support.

DeepSeek-V2 demonstrates that large-scale sparse MoE LLMs with joint low-rank attention compression achieve practical, scalable, and state-of-the-art performance while sharply reducing resource requirements, thus charting a direction for economical deployment of LLMs at ultra-large scale (DeepSeek-AI et al., 2024).

PDF Markdown Chat (Pro)

References (1)

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DeepSeek-V2.