DeepSeek-V3 Technical Report (2412.19437v2)

Published 27 Dec 2024 in cs.CL and cs.AI

Abstract: We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) LLM with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

Summary

The paper presents DeepSeek-V3, a 671B-parameter Mixture-of-Experts language model that activates only 37B parameters per token for efficiency.
The paper employs innovative techniques including Multi-Head Latent Attention and DeepSeekMoE to reduce inference cost and balance load without auxiliary loss.
The paper demonstrates strong benchmark performance through pre-training on 14.8 trillion tokens and advanced strategies such as Multi-Token Prediction and Reinforcement Learning.

The paper presents DeepSeek-V3, a Mixture-of-Experts (MoE) LLM with 671B total parameters, where 37B parameters are activated for each token. The model is designed for efficient inference and training, achieved through Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were previously validated in DeepSeek-V2. Key innovations include an auxiliary-loss-free strategy for load balancing and a multi-token prediction training objective. The model was pre-trained on 14.8 trillion tokens and further refined through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). The training costs totaled 2.788M H800 GPU hours.

Architecture

DeepSeek-V3 adopts a Transformer framework incorporating MLA and DeepSeekMoE.

Multi-Head Latent Attention (MLA): This reduces Key-Value (KV) cache size during inference. The compression is achieved through the equations:

$\mathbf{c}_{t}^{KV} = W^{DKV} \mathbf{h}_{t}$

$[\mathbf{k}_{t, 1}^{C};\mathbf{k}_{t, 2}^{C};...;\mathbf{k}_{t, n_{h}^{C}] = \mathbf{k}_{t}^{C} = W^{UK} \mathbf{c}_{t}^{KV}$

$\mathbf{k}_{t}^{R} = \operatorname{RoPE}({W^{KR} \mathbf{h}_{t}})$

$\mathbf{k}_{t, i} = [\mathbf{k}_{t, i}^{C}; \mathbf{k}_{t}^{R}]$

$[\mathbf{v}_{t, 1}^{C};\mathbf{v}_{t, 2}^{C};...;\mathbf{v}_{t, n_{h}^{C}] = \mathbf{v}_{t}^{C} = W^{UV} \mathbf{c}_{t}^{KV}$

where:
- $d$ is the embedding dimension.
- $n_h$ is the number of attention heads.
- $d_h$ is the dimension per head.
- $\mathbf{h}_{t} \in \mathbb{R}^{d}$ is the attention input for the $t$ -th token.
- $\mathbf{c}_{t}^{KV} \in \mathbb{R}^{d_c}$ is the compressed latent vector for keys and values.
- $d_c$ is the KV compression dimension.
- $W^{DKV} \in \mathbb{R}^{d_c \times d}$ is the down-projection matrix.
- $W^{UK},W^{UV} \in \mathbb{R}^{d_h n_h \times d_c}$ are the up-projection matrices for keys and values, respectively.
- $W^{KR} \in \mathbb{R}^{d_h^R \times d}$ is the matrix to produce the decoupled key that carries Rotary Positional Embedding (RoPE).
- $\operatorname{RoPE}(\cdot)$ denotes the operation that applies RoPE matrices.
- $[\cdot;\cdot]$ denotes concatenation.
DeepSeekMoE: This architecture employs finer-grained experts and isolates some experts as shared ones. The FFN output $\mathbf{h}_{t}^{\prime}$ $h_{t}^{'}$ is computed as:

$\mathbf{h}_{t}^{\prime} = \mathbf{u}_{t} + \sum_{i=1}^{N_{s}} {\operatorname{FFN}^{(s)}_{i}\left( \mathbf{u}_{t} \right)} + \sum_{i=1}^{N_r} {g_{i,t} \operatorname{FFN}^{(r)}_{i}\left( \mathbf{u}_{t} \right)}$

$g_{i,t} = \frac{g^{\prime}_{i,t}}{\sum_{j=1}^{N_r} g^{\prime}_{j,t}}$

$g{\prime}_{i,t} = \begin{cases} s_{i,t}, & s_{i,t} \in \operatorname{Topk} ({ s_{j, t} | 1 \le j \le N_r }, K_{r}), \ 0, & \text{otherwise}, \end{cases}$ $s_{i,t} = \operatorname{Sigmoid} \left( {\mathbf{u}{t}{T} \mathbf{e}{i} \right)$

Where:
- $N_{s}$ and $N_r$ denote the numbers of shared experts and routed experts, respectively.
- $\operatorname{FFN}^{(s)}_{i}(\cdot)$ and $\operatorname{FFN}^{(r)}_{i}(\cdot)$ denote the $i$ -th shared expert and the $i$ -th routed expert, respectively.
- $K_{r}$ denotes the number of activated routed experts.
- $g_{i,t}$ is the gating value for the $i$ -th expert.
- $s_{i,t}$ is the token-to-expert affinity.
- $\mathbf{e}_{i}$ is the centroid vector of the $i$ -th routed expert.
- $\operatorname{Topk}(\cdot, K)$ denotes the set comprising $K$ highest scores among the affinity scores calculated for the $t$ -th token and all routed experts.
DeepSeek-V3 also incorporates an auxiliary-loss-free load balancing strategy with a bias term $b_i$ for each expert: $g{\prime}_{i,t} = \begin{cases} s_{i,t}, & s_{i,t} + b_i \in \operatorname{Topk} ({ s_{j, t} + b_j | 1 \le j \le N_r }, K_{r}), \ 0, & \text{otherwise}. \end{cases}$ A complementary sequence-wise balance loss is also used: $\mathcal{L}{\mathrm{Bal}} = \alpha \sum{i=1}{N_r}{f_i P_i}$ $f_i = \frac{N_r}{K_r T} \sum_{t=1}{T} \mathds{1} \left( s_{i,t} \in \operatorname{Topk} ( { s_{j, t} | 1 \le j \le N_r }, K_{r} ) \right)$ $s{\prime}_{i,t} = \frac{s_{i,t}}{\sum_{j=1}{N_r} s_{j,t}}$ $P_i = \frac{1}{T} \sum_{t=1}{T}{s{\prime}_{i,t}}$

Where: * $\alpha$ is a hyper-parameter. * $\mathds{1}(\cdot)$ denotes the indicator function. * $T$ denotes the number of tokens in a sequence.

Node-limited routing ensures each token is sent to at most $M$ nodes.
Multi-Token Prediction (MTP): This extends the prediction scope to multiple future tokens. The $k$ -th MTP module consists of a shared embedding layer $\operatorname{Emb}(\cdot)$ , a shared output head $\operatorname{OutHead}(\cdot)$ , a Transformer block $\operatorname{TRM}_k(\cdot)$ , and a projection matrix $M_k \in \mathbb{R}^{d \times 2d}$ .

$\mathbf{h}_i^{\prime k} = M_k [\operatorname{RMSNorm}(\mathbf{h}_i^{k-1}) ; \operatorname{RMSNorm}(\operatorname{Emb}(t_{i+k}))]$

$\mathbf{h}_{1:T-k}^{k} = \operatorname{TRM}_k(\mathbf{h}_{1:T-k}^{\prime k})$

$P_{i+k+1}^{k} = \operatorname{OutHead}(\mathbf{h}_{i}^{k})$

A cross-entropy loss $\mathcal{L}_{\text{MTP}^{k}$ is computed for each prediction depth:

$\mathcal{L}_{\text{MTP}^{k} = \operatorname{CrossEntropy}(P_{2 + k:T + 1}^{k}, t_{2 + k:T + 1}) = -\frac{1}{T} \sum_{i=2 + k}^{T + 1} \log P_i^k [t_i]$

The overall MTP loss is:

$\mathcal{L}_{\text{MTP} = \frac{\lambda}{D} \sum_{k=1}^{D} \mathcal{L}_{\text{MTP}^{k}$

Infrastructures

Compute Clusters: The model was trained on a cluster of 2048 NVIDIA H800 GPUs.
Training Framework: The HAI-LLM framework supports 16-way Pipeline Parallelism (PP), 64-way Expert Parallelism (EP), and ZeRO-1 Data Parallelism (DP). The DualPipe algorithm is used for efficient pipeline parallelism and overlaps computation and communication. Efficient cross-node all-to-all communication kernels are developed to utilize InfiniBand (IB) and NVLink bandwidths.
FP8 Training: A mixed precision framework utilizing the FP8 data format was used. Tile-wise grouping with $1\times N_c$ elements or block-wise grouping with $N_c\times N_c$ elements are used for fine-grained quantization.
Inference and Deployment: The deployment strategy separates prefilling and decoding stages. Redundant experts are used to achieve load balancing among different experts in the MoE part.

Pre-Training

The pre-training corpus was optimized with mathematical and programming samples and expanded multilingual coverage. The training corpus consists of 14.8T tokens. The tokenizer employs Byte-level BPE with a vocabulary of 128K tokens. The number of Transformer layers was set to 61, and the hidden dimension was set to 7168. MLA uses 128 attention heads with a per-head dimension of 128. The KV compression dimension $d_c$ is 512, and the query compression dimension $d_c^{\prime}$ is 1536.

Long Context Extension

YaRN was applied for context extension in two phases to expand the context window from 4K to 32K and then to 128K.

Post-Training

Instruction-tuning datasets were curated to include 1.5M instances across multiple domains. Reasoning data was generated using an internal DeepSeek-R1 model, and non-reasoning data was generated using DeepSeek-V2.5. For RL, a rule-based Reward Model (RM) and a model-based RM were employed, along with Group Relative Policy Optimization (GRPO).

Evaluations

The base model was evaluated on benchmarks including MMLU, HellaSwag, PIQA, TriviaQA, GSM8K, HumanEval, and C-Eval. The chat model was evaluated on IFEval, FRAMES, LongBench v2, GPQA, SimpleQA, SWE-Bench Verified and LiveCodeBench. Ablation studies were conducted for the MTP strategy and the auxiliary-loss-free balancing strategy. The MTP strategy enhances model performance on most evaluation benchmarks. Batch-wise balancing imposes a flexible constraint, allowing experts to better specialize in different domains.

The team also offered suggestions on chip design to AI hardware vendors for communication and compute hardware.

Conclusion and Future Directions

DeepSeek-V3 achieves state-of-the-art performance among open-source models and is competitive with closed-source models. The team plans to invest in research to improve training and inference efficiency, explore new architectures, enhance data quality, and expand reasoning capabilities.