- The paper introduces FlexSP, a novel system using data-centric and heterogeneity-adaptive sequence parallelism to significantly improve LLM training efficiency by managing datasets with varied sequence lengths.
- FlexSP models training optimization as a MILP problem, using sequence bucketing and micro-batch chunking to assign sequences to groups and minimize time and memory costs.
- Evaluations show FlexSP outperforms state-of-the-art systems by up to 1.98x on diverse LLMs and datasets, demonstrating reduced communication overhead from its adaptive strategy.
The paper "Data-Centric and Heterogeneity-Adaptive Sequence Parallelism for Efficient LLM Training" introduces FlexSP, a novel system designed to enhance the training efficiency of LLM training by addressing the challenges posed by the heterogeneity of sequence lengths in training corpora. Current sequence parallelism (SP) methods assume a uniform sequence length, which leads to inefficiency and under-utilization of resources when dealing with the long-tail distribution of sequence lengths commonly found in LLM training data. FlexSP introduces a heterogeneity-adaptive SP method that dynamically adjusts parallelism strategies based on sequence length variation in each training step.
The key contributions of FlexSP are:
- Heterogeneous SP Groups: FlexSP adaptively forms multiple heterogeneous SP groups, allowing sequences of varying lengths to be processed with different parallelism degrees, balancing memory consumption and training efficiency.
- Time-Balanced Sequence Assignment: FlexSP optimizes the assignment of sequences to SP groups to minimize processing time and balance workload across groups, preventing faster groups from waiting for slower ones.
To achieve these innovations, the paper models the problem as a joint optimization to maximize training efficiency, which determines how to form the heterogeneous SP groups and how to assign each sequence to the most suitable group. This is transformed into a Mixed-Integer Linear Programming (MILP) problem by modeling computation cost, communication cost, and memory consumption. To decrease the complexity of the MILP problem, a sequence bucketing algorithm based on dynamic programming is utilized. Additionally, a micro-batch chunking algorithm is devised to minimize the total training time of all micro-batches when there are too many sequences that cannot be processed at once.
The FlexSP's solver consists of two major steps: sequence blaster and parallelism planner. The sequence blaster chunks the sequences into micro-batches, and the parallelism planner is responsible for solving the optimal plan for each micro-batch to minimize its execution time. The overall workflow of FlexSP solver first calculates the minimum feasible micro-batch number, then the sequence blaster is invoked to blast the sequences into micro-batches. Subsequently, for each micro-batch, sequences are grouped into buckets and then the parallelism planner optimizes the sequence parallelism strategies for the current micro-batch data, which solves the MILP problem.
The MILP problem is formulated as:
m∈{0,1}P;A∈{0,1}K×PargminC
where C is the minimized time, m is the group selection vector, and A is the sequence assignment matrix. The variables are defined as:
- P: The number of virtual SP groups
- K: The number of sequences
This is subject to several constraints:
Time({sk,Ak,p};dp)≤C,∀p∈[1,P]
Memory({sk,Ak,p};dp)≤E,∀p∈[1,P]
∑pdp×mp≤N
∑kAk,p≤mp×K,∀p∈[1,P]
∑pAk,p=1,∀k∈[1,K]
The variables are defined as:
- sk: The sequence length of Sk
- Ak,p: A binary variable that indicates whether Sk is assigned to Gp
- dp: The SP degree of Gp
- E: Device memory budget
- N: The number of available GPUs
The time and memory constraints are defined as:
Memory({sk,Ak,p};dp)=k∑dpAk,pskMtoken+Mms
Time({sk,Ak,p};dp)=Tcomp+Tcomm
The variables are defined as:
- Mtoken: The activation memory cost of each token
- Mms: The memory consumption of model states memory
- Tcomp: The computation time
- Tcomm: The communication time
Tcomp({sk,Ak,p};dp)=dp1k∑Ak,p(α1sk2+α2sk)+β1
Tcomm({sk,Ak,p};dp)=dpvp1k∑Ak,pα3sk+β2
The variables are defined as:
- α1,α2,β1: The coefficients of the α-β model for computation cost
- α3,β2: Coefficients given by profiling
- vp: Represents the interconnect bandwidth of the devices within Gp
FlexSP is implemented on top of PyTorch, utilizing NCCL as the communication backend and flash-attn for the attention kernel. The system supports hot switching of sequence parallelism to manage varied parallelism strategies and maintains a NCCL group pool for efficient communication group management. The solving and training phases are disaggregated to facilitate overlapping, with the solver running on CPUs and the executor on GPUs.
Experimental results on GPT-series LLMs with varying sizes (7B, 13B, 30B) and datasets (GitHub, CommonCrawl, Wikipedia) show that FlexSP outperforms state-of-the-art training systems like Megatron-LM and DeepSpeed by up to 1.98×. The performance gains are attributed to the reduction in communication overhead achieved by the flexible sequence parallelism strategy, which adapts to the long-tail distribution of sequence lengths. Ablation studies validate the efficacy of the dynamic programming sequence bucketing and the sequence sorting mechanism.