FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism (2412.01523v3)

Published 2 Dec 2024 in cs.DC and cs.LG

Abstract: Extending the context length (i.e., the maximum supported sequence length) of LLMs is of paramount significance. To facilitate long context training of LLMs, sequence parallelism has emerged as an essential technique, which scatters each input sequence across multiple devices and necessitates communication to process the sequence. In essence, existing sequence parallelism methods assume homogeneous sequence lengths (i.e., all input sequences are equal in length) and therefore leverages a single, static scattering strategy for all input sequences. However, in reality, the sequence lengths in LLM training corpora exhibit substantial variability, often following a long-tail distribution, which leads to workload heterogeneity. In this paper, we show that employing a single, static strategy results in inefficiency and resource under-utilization, highlighting the need for adaptive approaches to handle the heterogeneous workloads across sequences. To address this, we propose a heterogeneity-adaptive sequence parallelism method. For each training step, our approach captures the variability in sequence lengths and assigns the optimal combination of scattering strategies based on workload characteristics. We model this problem as a linear programming optimization and design an efficient and effective solver to find the optimal solution. Furthermore, we implement our method in a high-performance system that supports adaptive parallelization in distributed LLM training. Experimental results demonstrate that our system outperforms state-of-the-art training frameworks by up to 1.98x.

Summary

The paper introduces FlexSP, a novel system using data-centric and heterogeneity-adaptive sequence parallelism to significantly improve LLM training efficiency by managing datasets with varied sequence lengths.
FlexSP models training optimization as a MILP problem, using sequence bucketing and micro-batch chunking to assign sequences to groups and minimize time and memory costs.
Evaluations show FlexSP outperforms state-of-the-art systems by up to 1.98x on diverse LLMs and datasets, demonstrating reduced communication overhead from its adaptive strategy.

The paper "Data-Centric and Heterogeneity-Adaptive Sequence Parallelism for Efficient LLM Training" introduces FlexSP, a novel system designed to enhance the training efficiency of LLM training by addressing the challenges posed by the heterogeneity of sequence lengths in training corpora. Current sequence parallelism (SP) methods assume a uniform sequence length, which leads to inefficiency and under-utilization of resources when dealing with the long-tail distribution of sequence lengths commonly found in LLM training data. FlexSP introduces a heterogeneity-adaptive SP method that dynamically adjusts parallelism strategies based on sequence length variation in each training step.

The key contributions of FlexSP are:

Heterogeneous SP Groups: FlexSP adaptively forms multiple heterogeneous SP groups, allowing sequences of varying lengths to be processed with different parallelism degrees, balancing memory consumption and training efficiency.
Time-Balanced Sequence Assignment: FlexSP optimizes the assignment of sequences to SP groups to minimize processing time and balance workload across groups, preventing faster groups from waiting for slower ones.

To achieve these innovations, the paper models the problem as a joint optimization to maximize training efficiency, which determines how to form the heterogeneous SP groups and how to assign each sequence to the most suitable group. This is transformed into a Mixed-Integer Linear Programming (MILP) problem by modeling computation cost, communication cost, and memory consumption. To decrease the complexity of the MILP problem, a sequence bucketing algorithm based on dynamic programming is utilized. Additionally, a micro-batch chunking algorithm is devised to minimize the total training time of all micro-batches when there are too many sequences that cannot be processed at once.

The FlexSP's solver consists of two major steps: sequence blaster and parallelism planner. The sequence blaster chunks the sequences into micro-batches, and the parallelism planner is responsible for solving the optimal plan for each micro-batch to minimize its execution time. The overall workflow of FlexSP solver first calculates the minimum feasible micro-batch number, then the sequence blaster is invoked to blast the sequences into micro-batches. Subsequently, for each micro-batch, sequences are grouped into buckets and then the parallelism planner optimizes the sequence parallelism strategies for the current micro-batch data, which solves the MILP problem.

The MILP problem is formulated as:

$\qquad \argmin_{\boldsymbol{m} \in \{0,1\}^P; \boldsymbol{A} \in \{0,1\}^{K\times P}} C$

where $C$ is the minimized time, $\boldsymbol{m}$ is the group selection vector, and $\boldsymbol{A}$ is the sequence assignment matrix. The variables are defined as:

$P$ : The number of virtual SP groups
$K$ : The number of sequences

This is subject to several constraints:

$\qquad \text{Time}(\{s_k, A_{k,p}\};d_p) \leq C, \forall p \in [1,P]$

$\qquad \text{Memory}(\{s_k, A_{k,p}\};d_p) \leq E \text, \forall p \in [1,P]$

$\qquad \sum\nolimits_p{d_p\times m_p} \leq N$

$\qquad \sum\nolimits_k{A_{k,p} \leq m_p\times K, \forall p \in [1,P]}$

$\qquad \sum\nolimits_p{A_{k,p} = 1, \forall k \in [1,K]}$

The variables are defined as:

$s_k$ : The sequence length of $\mathcal{S}_k$
$A_{k,p}$ : A binary variable that indicates whether $\mathcal{S}_k$ is assigned to $\mathcal{G}_p$
$d_p$ : The SP degree of $\mathcal{G}_p$
$E$ : Device memory budget
$N$ : The number of available GPUs

The time and memory constraints are defined as:

$\qquad \text{Memory}(\{s_k, A_{k,p}\};d_p) = \sum_{k} \frac{A_{k,p}s_k}{d_p} \text{M}_{token} + \text{M}_{ms}$

$\qquad \text{Time}(\{s_k, A_{k,p}\};d_p) = \text{T}_{comp} + \text{T}_{comm}$

The variables are defined as:

$\text{M}_{token}$ : The activation memory cost of each token
$\text{M}_{ms}$ : The memory consumption of model states memory
$\text{T}_{comp}$ : The computation time
$\text{T}_{comm}$ : The communication time

$\qquad \text{T}_{comp}(\{s_k, A_{k,p}\};d_p) = \frac{1}{d_p} \sum_{k} A_{k,p} (\alpha_1 s_k^2+\alpha_2 s_k) + \beta_1$

$\qquad \text{T}_{comm}(\{s_k, A_{k,p}\};d_p) = \frac{1}{d_p v_p} \sum_{k} A_{k,p} \alpha_3 s_k + \beta_2$

The variables are defined as:

$\alpha_1, \alpha_2, \beta_1$ : The coefficients of the $\alpha$ - $\beta$ model for computation cost
$\alpha_3, \beta_2$ : Coefficients given by profiling
$v_p$ : Represents the interconnect bandwidth of the devices within $\mathcal{G}_p$

FlexSP is implemented on top of PyTorch, utilizing NCCL as the communication backend and flash-attn for the attention kernel. The system supports hot switching of sequence parallelism to manage varied parallelism strategies and maintains a NCCL group pool for efficient communication group management. The solving and training phases are disaggregated to facilitate overlapping, with the solver running on CPUs and the executor on GPUs.

Experimental results on GPT-series LLMs with varying sizes (7B, 13B, 30B) and datasets (GitHub, CommonCrawl, Wikipedia) show that FlexSP outperforms state-of-the-art training systems like Megatron-LM and DeepSpeed by up to 1.98 $\times$ . The performance gains are attributed to the reduction in communication overhead achieved by the flexible sequence parallelism strategy, which adapts to the long-tail distribution of sequence lengths. Ablation studies validate the efficacy of the dynamic programming sequence bucketing and the sequence sorting mechanism.

PDF Markdown

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism (2412.01523v3)

Summary

Related Papers