Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeepSeek-V3 Technical Report (2412.19437v2)

Published 27 Dec 2024 in cs.CL and cs.AI

Abstract: We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) LLM with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

Summary

  • The paper presents DeepSeek-V3, a 671B-parameter Mixture-of-Experts language model that activates only 37B parameters per token for efficiency.
  • The paper employs innovative techniques including Multi-Head Latent Attention and DeepSeekMoE to reduce inference cost and balance load without auxiliary loss.
  • The paper demonstrates strong benchmark performance through pre-training on 14.8 trillion tokens and advanced strategies such as Multi-Token Prediction and Reinforcement Learning.

The paper presents DeepSeek-V3, a Mixture-of-Experts (MoE) LLM with 671B total parameters, where 37B parameters are activated for each token. The model is designed for efficient inference and training, achieved through Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were previously validated in DeepSeek-V2. Key innovations include an auxiliary-loss-free strategy for load balancing and a multi-token prediction training objective. The model was pre-trained on 14.8 trillion tokens and further refined through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). The training costs totaled 2.788M H800 GPU hours.

Architecture

DeepSeek-V3 adopts a Transformer framework incorporating MLA and DeepSeekMoE.

  • Multi-Head Latent Attention (MLA): This reduces Key-Value (KV) cache size during inference. The compression is achieved through the equations:

    ctKV=WDKVht\mathbf{c}_{t}^{KV} = W^{DKV} \mathbf{h}_{t}

    $[\mathbf{k}_{t, 1}^{C};\mathbf{k}_{t, 2}^{C};...;\mathbf{k}_{t, n_{h}^{C}] = \mathbf{k}_{t}^{C} = W^{UK} \mathbf{c}_{t}^{KV}$

    ktR=RoPE(WKRht)\mathbf{k}_{t}^{R} = \operatorname{RoPE}({W^{KR} \mathbf{h}_{t}})

    kt,i=[kt,iC;ktR]\mathbf{k}_{t, i} = [\mathbf{k}_{t, i}^{C}; \mathbf{k}_{t}^{R}]

    $[\mathbf{v}_{t, 1}^{C};\mathbf{v}_{t, 2}^{C};...;\mathbf{v}_{t, n_{h}^{C}] = \mathbf{v}_{t}^{C} = W^{UV} \mathbf{c}_{t}^{KV}$

    where:

    • dd is the embedding dimension.
    • nhn_h is the number of attention heads.
    • dhd_h is the dimension per head.
    • htRd\mathbf{h}_{t} \in \mathbb{R}^{d} is the attention input for the tt-th token.
    • ctKVRdc\mathbf{c}_{t}^{KV} \in \mathbb{R}^{d_c} is the compressed latent vector for keys and values.
    • dcd_c is the KV compression dimension.
    • WDKVRdc×dW^{DKV} \in \mathbb{R}^{d_c \times d} is the down-projection matrix.
    • WUK,WUVRdhnh×dcW^{UK},W^{UV} \in \mathbb{R}^{d_h n_h \times d_c} are the up-projection matrices for keys and values, respectively.
    • WKRRdhR×dW^{KR} \in \mathbb{R}^{d_h^R \times d} is the matrix to produce the decoupled key that carries Rotary Positional Embedding (RoPE).
    • RoPE()\operatorname{RoPE}(\cdot) denotes the operation that applies RoPE matrices.
    • [;][\cdot;\cdot] denotes concatenation.
  • DeepSeekMoE: This architecture employs finer-grained experts and isolates some experts as shared ones. The FFN output ht\mathbf{h}_{t}^{\prime} is computed as:

    ht=ut+i=1NsFFNi(s)(ut)+i=1Nrgi,tFFNi(r)(ut)\mathbf{h}_{t}^{\prime} = \mathbf{u}_{t} + \sum_{i=1}^{N_{s}} {\operatorname{FFN}^{(s)}_{i}\left( \mathbf{u}_{t} \right)} + \sum_{i=1}^{N_r} {g_{i,t} \operatorname{FFN}^{(r)}_{i}\left( \mathbf{u}_{t} \right)}

    gi,t=gi,tj=1Nrgj,tg_{i,t} = \frac{g^{\prime}_{i,t}}{\sum_{j=1}^{N_r} g^{\prime}_{j,t}}

    g<sup>i,t</sup>={si,t,amp;si,tTopk(sj,t1jNr,Kr), 0,amp;otherwise,g<sup>{\prime}_{i,t}</sup> = \begin{cases} s_{i,t}, &amp; s_{i,t} \in \operatorname{Topk} ({ s_{j, t} | 1 \le j \le N_r }, K_{r}), \ 0, &amp; \text{otherwise}, \end{cases} $s_{i,t} = \operatorname{Sigmoid} \left( {\mathbf{u}<em>{t}<sup>{T}</sup> \mathbf{e}</em>{i} \right)$

    Where:

    • NsN_{s} and NrN_r denote the numbers of shared experts and routed experts, respectively.
    • FFNi(s)()\operatorname{FFN}^{(s)}_{i}(\cdot) and FFNi(r)()\operatorname{FFN}^{(r)}_{i}(\cdot) denote the ii-th shared expert and the ii-th routed expert, respectively.
    • KrK_{r} denotes the number of activated routed experts.
    • gi,tg_{i,t} is the gating value for the ii-th expert.
    • si,ts_{i,t} is the token-to-expert affinity.
    • ei\mathbf{e}_{i} is the centroid vector of the ii-th routed expert.
    • Topk(,K)\operatorname{Topk}(\cdot, K) denotes the set comprising KK highest scores among the affinity scores calculated for the tt-th token and all routed experts.

    DeepSeek-V3 also incorporates an auxiliary-loss-free load balancing strategy with a bias term bib_i for each expert: g<sup>i,t</sup>={si,t,amp;si,t+biTopk(sj,t+bj1jNr,Kr), 0,amp;otherwise.g<sup>{\prime}_{i,t}</sup> = \begin{cases} s_{i,t}, &amp; s_{i,t} + b_i \in \operatorname{Topk} ({ s_{j, t} + b_j | 1 \le j \le N_r }, K_{r}), \ 0, &amp; \text{otherwise}. \end{cases}A complementary sequence-wise balance loss is also used:L<em>Bal=α</em>i=1<sup>Nrfi</sup>Pi\mathcal{L}<em>{\mathrm{Bal}} = \alpha \sum</em>{i=1}<sup>{N_r}{f_i</sup> P_i} $f_i = \frac{N_r}{K_r T} \sum_{t=1}<sup>{T}</sup> \mathds{1} \left( s_{i,t} \in \operatorname{Topk} ( { s_{j, t} | 1 \le j \le N_r }, K_{r} ) \right)$ s<sup>i,t</sup>=si,tj=1<sup>Nr</sup>sj,ts<sup>{\prime}_{i,t}</sup> = \frac{s_{i,t}}{\sum_{j=1}<sup>{N_r}</sup> s_{j,t}} Pi=1Tt=1<sup>Ts<sup>i,tP_i = \frac{1}{T} \sum_{t=1}<sup>{T}{s<sup>{\prime}_{i,t}}

    Where: * α\alpha is a hyper-parameter. * $\mathds{1}(\cdot)$ denotes the indicator function. * TT denotes the number of tokens in a sequence.

    Node-limited routing ensures each token is sent to at most MM nodes.

  • Multi-Token Prediction (MTP): This extends the prediction scope to multiple future tokens. The kk-th MTP module consists of a shared embedding layer Emb()\operatorname{Emb}(\cdot), a shared output head OutHead()\operatorname{OutHead}(\cdot), a Transformer block TRMk()\operatorname{TRM}_k(\cdot), and a projection matrix MkRd×2dM_k \in \mathbb{R}^{d \times 2d}.

    hik=Mk[RMSNorm(hik1);RMSNorm(Emb(ti+k))]\mathbf{h}_i^{\prime k} = M_k [\operatorname{RMSNorm}(\mathbf{h}_i^{k-1}) ; \operatorname{RMSNorm}(\operatorname{Emb}(t_{i+k}))]

    h1:Tkk=TRMk(h1:Tkk)\mathbf{h}_{1:T-k}^{k} = \operatorname{TRM}_k(\mathbf{h}_{1:T-k}^{\prime k})

    Pi+k+1k=OutHead(hik)P_{i+k+1}^{k} = \operatorname{OutHead}(\mathbf{h}_{i}^{k})

    A cross-entropy loss $\mathcal{L}_{\text{MTP}^{k}$ is computed for each prediction depth:

    $\mathcal{L}_{\text{MTP}^{k} = \operatorname{CrossEntropy}(P_{2 + k:T + 1}^{k}, t_{2 + k:T + 1}) = -\frac{1}{T} \sum_{i=2 + k}^{T + 1} \log P_i^k [t_i]$

    The overall MTP loss is:

    $\mathcal{L}_{\text{MTP} = \frac{\lambda}{D} \sum_{k=1}^{D} \mathcal{L}_{\text{MTP}^{k}$

Infrastructures

  • Compute Clusters: The model was trained on a cluster of 2048 NVIDIA H800 GPUs.
  • Training Framework: The HAI-LLM framework supports 16-way Pipeline Parallelism (PP), 64-way Expert Parallelism (EP), and ZeRO-1 Data Parallelism (DP). The DualPipe algorithm is used for efficient pipeline parallelism and overlaps computation and communication. Efficient cross-node all-to-all communication kernels are developed to utilize InfiniBand (IB) and NVLink bandwidths.
  • FP8 Training: A mixed precision framework utilizing the FP8 data format was used. Tile-wise grouping with 1×Nc1\times N_c elements or block-wise grouping with Nc×NcN_c\times N_c elements are used for fine-grained quantization.
  • Inference and Deployment: The deployment strategy separates prefilling and decoding stages. Redundant experts are used to achieve load balancing among different experts in the MoE part.

Pre-Training

The pre-training corpus was optimized with mathematical and programming samples and expanded multilingual coverage. The training corpus consists of 14.8T tokens. The tokenizer employs Byte-level BPE with a vocabulary of 128K tokens. The number of Transformer layers was set to 61, and the hidden dimension was set to 7168. MLA uses 128 attention heads with a per-head dimension of 128. The KV compression dimension dcd_c is 512, and the query compression dimension dcd_c^{\prime} is 1536.

Long Context Extension

YaRN was applied for context extension in two phases to expand the context window from 4K to 32K and then to 128K.

Post-Training

Instruction-tuning datasets were curated to include 1.5M instances across multiple domains. Reasoning data was generated using an internal DeepSeek-R1 model, and non-reasoning data was generated using DeepSeek-V2.5. For RL, a rule-based Reward Model (RM) and a model-based RM were employed, along with Group Relative Policy Optimization (GRPO).

Evaluations

The base model was evaluated on benchmarks including MMLU, HellaSwag, PIQA, TriviaQA, GSM8K, HumanEval, and C-Eval. The chat model was evaluated on IFEval, FRAMES, LongBench v2, GPQA, SimpleQA, SWE-Bench Verified and LiveCodeBench. Ablation studies were conducted for the MTP strategy and the auxiliary-loss-free balancing strategy. The MTP strategy enhances model performance on most evaluation benchmarks. Batch-wise balancing imposes a flexible constraint, allowing experts to better specialize in different domains.

The team also offered suggestions on chip design to AI hardware vendors for communication and compute hardware.

Conclusion and Future Directions

DeepSeek-V3 achieves state-of-the-art performance among open-source models and is competitive with closed-source models. The team plans to invest in research to improve training and inference efficiency, explore new architectures, enhance data quality, and expand reasoning capabilities.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. DeepSeek-V3 Technical Report (132 points, 34 comments)
  2. DeepSeek-V3 (124 points, 39 comments)
  3. DeepSeek-V3 Technical Report (3 points, 0 comments)
Reddit Logo Streamline Icon: https://streamlinehq.com