Memory Analysis on the Training Course of DeepSeek Models (2502.07846v1)

Published 11 Feb 2025 in cs.PF and cs.LG

Abstract: We present a theoretical analysis of GPU memory consumption during the training of DeepSeek models such as DeepSeek-v2 and DeepSeek-v3. Our primary objective is to clarify the device-level memory requirements associated with various distributed training configurations. Specifically, we examine critical factors influencing memory usage, including micro-batch size, activation recomputation policies, 3D parallelism, and ZeRO optimizations. It is important to emphasize that the training policies discussed in this report are not representative of DeepSeek's official configurations. Instead, they are explored to provide a deeper understanding of memory dynamics in training of large-scale mixture-of-experts model.

Summary

The paper provides a theoretical analysis of GPU memory consumption during DeepSeek model training, detailing requirements under various distributed configurations, 3D parallelism, ZeRO, and activation recomputation.
It quantifies parameter memory for DeepSeek-v3's MLA and MoE components under specific parallel setups and demonstrates memory reduction achieved through different ZeRO strategies.
The analysis includes activation memory estimation under various parallelizations and recomputation policies, also accounting for practical overheads like memory fragmentation and communication buffers.

The paper provides a theoretical analysis of GPU memory consumption during the training of DeepSeek models, specifically DeepSeek-v2 [liu2024deepseekv2] and DeepSeek-v3 [liu2024deepseekv3]. The analysis clarifies device-level memory requirements associated with various distributed training configurations, focusing on factors such as micro-batch size, activation recomputation policies, 3D parallelism, and ZeRO optimizations. Note that the training policies discussed are not representative of DeepSeek's official configurations but serve to provide a deeper understanding of memory dynamics in large-scale MoE model training.

DeepSeek-v3's architecture comprises 61 layers, each including two RMSNorm operations, an MLA block, and a linear layer. The first three transformer layers use conventional feed-forward networks (FFN), while the remaining 58 layers implement MoE linear layers. The analysis primarily focuses on memory consumption during training using FP16/BF16 formats.

The paper delineates the dimensional specifications of parameter matrices within the MLA and MoE components, with a concentration on the MoE layers' configurations due to their significant impact on memory consumption.

A detailed quantitative analysis of each component's parameters per layer is provided, expressing memory footprint in both MB and GB under FP16/BF16 precision. The aggregate parameter count and corresponding memory requirements for the MLA components are derived from the dimensional specifications. Unlike some transformer architectures, DeepSeek-v3 maintains separate parameters for the input token embedding matrix in the first layer and the output projection matrix in the final layer. The total parameter count is reported as 671 B, requiring 1,280,000 MB (1250 GB).

The paper uses PP16 pipeline parallel, the same as DeepSeek's official configuration, as a case paper to quantify the peak memory utilization per GPU device, identifying the pipeline stage with the maximum parameter volume. Stages 1-14 exhibit identical architecture and collectively constitute the largest parameter footprint, with each stage requiring 86 GB for static parameter storage.

The device-level static parameter partitioning is examined, governed by Tensor Parallelism (TP), Expert Parallelism (EP), and Expert Tensor Parallelism (ETP). The parallel configuration used is DP 32, TP 2, PP 16, EP 8, ETP 1, and EDP 8. The implementation of TP follows the method established in Megatron-LM [narayanan2021efficient], with the transformer block incorporating MLA and MoE components.

In the PP16@TP2 parallel configuration, both RMSNorm components within each layer are replicated across all TP ranks. The total memory footprint per GPU for RMSNorm parameters amounts to 131,072 bytes. According to the Megatron-LM implementation of MLA, $W^{UQ}$ , $W^{UK}$ , $W^{UV}$ , and $W^{O}$ are split across TP ranks, while $W^{DQ}$ , $W^{DKV}$ , $W^{QR}$ , and $W^{KR}$ are replicated on each rank without TP partitioning. With the BF16 data type, the memory requirement per GPU is 859,308,032 bytes. For the MoE component, under the PP16@EP8@ETP1 configuration, each stage contains 4 layers, and the 256 routing experts per layer are evenly distributed across 8 ranks, resulting in 32 routing experts per rank. The shared expert is replicated across all ranks. With ETP1 the parameter matrices of individual experts are not partitioned using TP. The memory requirement per GPU is 11,641,290,752 bytes.

DeepSpeed ZeRO [rasley2020deepspeed, rajbhandari2020zero] strategies (os, os+g, and os+g+params) are analyzed to show how they affect memory consumption per device. The analysis of the ZeRO strategies is based on the parallel configuration described in the paper. The baseline model without ZeRO requires 11.64 GB for parameters, 23.3 GB for optimizer states, and 46.6 GB for gradients. The implementation of various ZeRO strategies progressively reduces memory usage.

The activation memory analysis considers two native cases: no recomputation and full recomputation [korthikanti2023reducing]. The configurations used in the case paper of activation memory analysis include micro batch size, sequence length, number of routed experts, number of experts in each MoE layer, sequence parallelism (SP), context parallelism (CP), and activation recomputation (AC).

The total activation size (in bytes) without any parallelization is given by the following formula:

$4bsh + 2bs(d_{cq} + d_c) + 4bs(d_h + d_{hr}) \cdot n_h + 2bs(d_h \cdot n_h) + 5bn_hs^2 + 2bs(d_h \cdot n_h) + bsh$

Where:

$b$ is the micro batch size
$s$ is the sequence length
$h$ is the hidden dimension
$d_{cq}$ is the query compression dimension
$d_c$ is the key-value compression dimension
$d_h$ is the dimension per head
$d_{hr}$ is the per-head dimension of q/k for rope
$n_h$ is the number of attention heads

The activation memory of MLA component under the parallel strategy of TP2@SP2@CP1 reveals significant differences between non-recomputation and full-recomputation. For MoE layers with balanced load distribution, the average number of tokens processed by a single expert per MoE layer and micro batch is estimated by $E_{token} = \frac{bs*N_r}{N}$ . The activation memory of MoE linear per layer under SP2@EP8@ETP1 configuration without recomputation is: $M^E_1=4bsh/2 + 4bsN + 2bsN_r + 32*(3*E_{token}*h + 8*E_{token}*h_E) + 1(3bsh+8bsh_E)$ . With full recomputation, the memory footprint reduces to $M^E_2=bsh + 2bsN_r$ per layer, maintaining the Router outputs for consistency.

Finally, the paper notes that training memory estimation in practical implementations is affected by memory fragmentation (typically 5\% to 30% overhead) and temporary communication buffers (0.8 GB to 2 GB per device).

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Memory Analysis on the Training Course of DeepSeek Models (2502.07846v1)

Summary

Follow-up Questions

Related Papers

Authors (2)