DeepSeek V2: Scalable MoE Transformer

Updated 30 November 2025

DeepSeek V2 is a family of transformer-based Mixture-of-Experts models that combines sparse expert routing with Multi-Head Latent Attention for high-performance language and task specialization.
It significantly reduces compute and memory costs by activating only a fraction of its 236B parameters, supporting long-context processing up to 128K tokens.
The architecture underpins domain-specialized models for code intelligence, vision-language reasoning, and formal mathematical reasoning, matching or exceeding closed-source benchmarks.

DeepSeek V2 refers to a family of transformer-based Mixture-of-Experts (MoE) models developed for economical, efficient, and high-performing language understanding and generation. The core DeepSeek-V2 architecture—initially introduced as a large-scale general-purpose LLM—serves as the foundation for several domain-specialized models, including DeepSeek-Coder-V2 for code intelligence, DeepSeek-VL2 for vision-language reasoning, and DeepSeek-Prover-V2 for formal mathematical reasoning. Central technical pillars include a scalable sparse MoE topology and the Multi-Head Latent Attention (MLA) mechanism, enabling efficient training and inference at scale (DeepSeek-AI et al., 7 May 2024, DeepSeek-AI et al., 17 Jun 2024, &&&2&&&, Ren et al., 30 Apr 2025).

1. Architectural Innovations: Mixture-of-Experts and MLA

DeepSeek-V2 employs a 60-layer transformer in which nearly all feed-forward blocks are replaced by sparse Mixture-of-Experts (MoE) sublayers (“DeepSeekMoE”), paired with Multi-Head Latent Attention (MLA) for efficient sequence modeling (DeepSeek-AI et al., 7 May 2024). The model has 236 billion total parameters, but activates only 21 billion per token at inference—a ratio of ≈9%.

In DeepSeekMoE, each input token $u_t \in \mathbb{R}^d$ is routed to $K_r = 6$ out of $N_r = 160$ routed experts plus $N_s = 2$ shared experts per layer. Expert selection is mediated by lightweight affinity scores: $g_{i, t}$ is nonzero only for the top- $K_r$ experts as determined by $\operatorname{Softmax}_i(u_t^\top e_i)$ , where $e_i$ are learned expert centroids.

MLA replaces classical multi-head attention with a two-component strategy: a low-rank compression of key/value pairs (reducing the inference-time KV cache by ≈93.3%) and a lightweight “decoupled” RoPE-carrying head for position encoding (DeepSeek-AI et al., 7 May 2024). For $n_h=128$ , $d_h=128$ , and $l=60$ , MLA shrinks the per-token cache from $1\,966\,080$ elements to $34\,560$ .

This design reduces both compute and memory costs, particularly benefiting high-throughput inference and long-context workloads.

2. Training Efficiency, Memory Dynamics, and Parallelism

Training DeepSeek-V2 is highly economical due to MoE sparsity, MLA compression, and optimized parallelism. Only a small fraction of total parameters are activated per forward/backward pass, resulting in nearly halved training FLOPs and a measured 42.5% reduction in GPU-hours per trillion tokens compared to a 67B dense model on the same hardware (DeepSeek-AI et al., 7 May 2024).

A detailed memory analysis (Zhang et al., 11 Feb 2025) decomposes GPU footprint into static parameters, gradients, optimizer states, activations, and overhead buffers. For the configuration $(\mathrm{PP}{16}@\mathrm{TP}{2}@\mathrm{EP}{8}@\mathrm{DP}{32}, b=2, s=4096)$ :

Static params: $11.64\,\mathrm{GB}$
Gradients: $23.3\,\mathrm{GB}$
Optimizer: $46.6\,\mathrm{GB}$
Activations (no recompute): $12\,\mathrm{GB}$ ; (full recompute): $0.5\,\mathrm{GB}$
Peak with no ZeRO: $\sim 94\,\mathrm{GB}$ ; ZeRO Stage 3 reduces this to $9.66\,\mathrm{GB}$ per GPU

Activation recomputation policies enable a flexible memory–compute Pareto: from storing all intermediates (high memory, low compute), to recomputing on-the-fly (low memory, roughly $2\times$ compute). 3D parallelism (Data, Tensor, Expert, Pipeline) allows scalable training across arbitrary GPU clusters, with per-GPU memory scaling as $1/\mathrm{DP}$ and expert/tensor partitioning (Zhang et al., 11 Feb 2025).

3. Model Scaling, Pre-training Data, and Context Extension

The base DeepSeek-V2 model is pretrained on 8.1T tokens with a byte-level BPE vocabulary of 100K, using bilingual (English, Chinese) and high-quality web, code, and math data (DeepSeek-AI et al., 7 May 2024). Context length supports up to 128K tokens, enabled by the YARN positional-interpolation technique with scale parameters $(s=40, \alpha=1, \beta=32)$ (DeepSeek-AI et al., 17 Jun 2024), supporting project-level or document-level reasoning at scale.

Subsequent domain models adopt the same backbone and extend pretraining:

DeepSeek-Coder-V2: additional 6T tokens, primarily source code (60%), math (10%), and language (30%); expands language coverage from 86 to 338 (DeepSeek-AI et al., 17 Jun 2024).
DeepSeek-VL2: vision-language paired data ( $\sim$ 800B tokens), image OCR, QA, and document structure, supporting dynamic tiling of images up to arbitrary aspect ratios (Wu et al., 13 Dec 2024).

4. Specializations: Code, Vision-Language, and Formal Mathematics

DeepSeek-Coder-V2 leverages DeepSeek-V2’s MoE backbone for code intelligence, offering both 16B (2.4B activated) and 236B (21B activated) variants with long (128K) context. It augments its capabilities on multilingual code, code completion, fill-in-the-middle, code repair, and math benchmarks, matching or exceeding closed-source models such as GPT-4 Turbo in tasks including HumanEval (90.2%) and MBPP+ (76.2%) (DeepSeek-AI et al., 17 Jun 2024).

DeepSeek-VL2 adapts the architecture for vision-language reasoning by incorporating a dynamic tiling vision encoder and retaining the MoE+MLA stack for language. Three variants (1.0B, 2.8B, 4.5B activated parameters) attain state-of-the-art open-source performance in OCR, multimodal QA, and visual grounding (Wu et al., 13 Dec 2024).

DeepSeek-Prover-V2 scales formal mathematical reasoning in Lean 4 by initializing from DeepSeek-V3 then applying recursive subgoal decomposition, supervised fine-tuning, and RL alignment with Group Relative Policy Optimization (GRPO). It achieves 88.9% pass@8192 on MiniF2F-test and demonstrates closing of the prior gap between informal chain-of-thought reasoning and syntactic formal proof (Ren et al., 30 Apr 2025).

5. Fine-Tuning, RL Alignment, and Evaluation

After pretraining, DeepSeek-V2 models undergo two alignment phases:

Supervised Fine-Tuning (SFT) on a mixed corpus of instruction, code, alignment, and safety data.
Reinforcement Learning (GRPO) using groupwise policy-optimization with standardized reward aggregation, operating directly on outputs without a separate critic (DeepSeek-AI et al., 7 May 2024, Ren et al., 30 Apr 2025).

Chat and instruct versions of DeepSeek-V2 achieve or exceed leading open-source results. For example, DeepSeek-V2 Chat (RL) posts MT-Bench 8.97 and AlpacaEval win-rate 38.9%, while scoring 7.91 on AlignBench (GPT-4 rated), second only to GPT-4-1106, and ties ERNIEBot 4.0 (DeepSeek-AI et al., 7 May 2024).

6. Practical Training Guidelines and Deployability

Efficient large-scale training is enabled by tuning micro-batch size, activation checkpointing, and ZeRO sharding:

For 8×A100 80GB: ZeRO Stage 2 with $b=2$ and full recompute yields a per-GPU footprint of 21.4 GB, with sufficient margin for $b=4$ .
Best throughput uses $b=4$ , AC=None, ZeRO Stage 2 ( $\sim$ 33 GB/GPU).
ZeRO Stage 3 is preferable for maximal robustness, allowing even larger micro-batches (Zhang et al., 11 Feb 2025).

Inference optimization benefits directly from MLA’s KV-cache compression and MoE’s sparse routing, supporting extremely long sequence handling and throughput up to 5.76× that of DeepSeek-67B on the same hardware (DeepSeek-AI et al., 7 May 2024).

7. Empirical Performance and Limitations

DeepSeek-V2 and its derivatives lead open-source models on a variety of benchmarks:

Model / Task	Act. Param	MMLU (5-shot)	HumanEval	MT-Bench	MBPP+	AlignBench
DeepSeek-V2 (RL)	21 B	78.5	48.8	8.97	—	7.91
DS-Coder-V2	21 B	79.2	90.2	8.77	76.2	7.84
LLaMA 3 70B	70 B	78.9	48.2	8.95	—	—
Qwen1.5 72B	72 B	77.2	43.9	8.61	—	—
Mixtral 8×22B	39 B	77.6	53.1	8.66	—	—

DeepSeek-Coder-V2 matches or exceeds closed-source GPT-4 Turbo on coding, math, and code understanding in most languages (DeepSeek-AI et al., 17 Jun 2024). DeepSeek-VL2 achieves strong OCR, QA, and grounding results in compact open-source model scales (Wu et al., 13 Dec 2024).

Limitations include a residual gap in instruction-following compared to GPT-4 (SWE-Bench), lack of native tool interaction, inference cost at maximal context length, and challenges in rare-language representation. A plausible implication is that while MoE and MLA architectures have pushed open-source scaling and efficiency, further advances in model alignment, integrated tool use, and robust rare-language modeling will be needed to maintain parity with evolving proprietary systems.