- The paper introduces SPPO, a novel framework improving long-sequence LLM training efficiency through adaptive offloading and pipeline scheduling to optimize memory and resource use.
- Experimental validation shows SPPO achieves up to a 3.38x throughput improvement over Megatron-LM and DeepSpeed, enabling training of large models with very long sequences on fewer GPUs.
- Compared to Megatron-LM and DeepSpeed, SPPO demonstrates superior scalability and memory efficiency, avoiding out-of-memory errors and handling significantly longer sequences with fixed resources.
SPPO (Adaptive Sequence Pipeline Parallel Offloading) is presented as a novel framework to improve the training efficiency of LLMs on long sequences through optimized memory and computational resource utilization. SPPO addresses limitations of existing methodologies by employing sequence partitioning and adaptive offloading with pipeline scheduling techniques.
Core Methodologies of SPPO
The SPPO framework incorporates several key methods to optimize LLM training:
- Adaptive Offloading: Overlaps the offloading of activations with computation by dynamically adjusting the offloading ratio for each subsequence.
- Sequence-Aware Offloading: Mitigates imbalanced memory allocation across subsequences by adaptively computing the offload ratio. This maximizes the overlap between the CPU offloading of activations from the (i−1)th subsequence and the GPU computation of the ith subsequence.
- Two-Level Activation Management: Retains "skeletal" activations (those with high access frequency) in GPU memory while offloading activations with lower access frequency to the CPU. This prioritizes CPU-GPU bandwidth for the less frequently accessed activations, minimizing offloading overhead.
- Adaptive Pipeline Scheduling: Optimizes GPU utilization and pipeline efficiency.
- Heuristic Solver: Determines the ideal number of subsequences to balance GPU utilization and pipeline efficiency. It is combined with the offloading ratio from adaptive offloading to improve both memory efficiency and training efficiency while avoiding cross-node sequence and pipeline parallelism. The optimal workload size is from 2k to 16k tokens per layer.
- Multiplexed Sequence Partitioning: Delivers a fine-grained partition of adjacent subsequences to reduce resource bubbles without sacrificing memory efficiency in scenarios where the heuristic solver cannot fully eliminate them. This partitioning scheme consists of Left-SP, Steady, and Right-SP phases. The Left-SP phase processes edge subsequences near pipeline boundaries, while the Right-SP phase manages terminal computations. The Steady phase handles central subsequences using standard pipeline parallelism.
Experimental Validation
The effectiveness of SPPO was demonstrated through experiments using Transformer-based models with a GPT architecture, ranging from 7B to 65B parameters, and sequence lengths up to 4M tokens. These experiments were performed on a cluster of 128 NVIDIA Ampere GPUs. The results indicate that SPPO achieves up to a 3.38x throughput improvement over Megatron-LM and DeepSpeed. SPPO also enables the efficient training of a 7B LLM with sequence lengths of 1M, 2M, and 4M tokens on only 32, 64, and 128 NVIDIA Ampere GPUs, respectively.
Comparative Analysis with Megatron-LM and DeepSpeed
SPPO's performance was rigorously compared against existing frameworks like Megatron-LM and DeepSpeed, highlighting its advantages in throughput, scalability, and memory efficiency.
- Throughput and Scalability: SPPO consistently outperforms DeepSpeed Ulysses and Megatron-LM across various sequence lengths and model sizes. DeepSpeed-Ulysses and Megatron-LM struggle with out-of-memory (OOM) errors when dealing with ultra-long sequences.
- Memory Efficiency: SPPO avoids costly activation recomputation, a limitation that impacts Megatron-LM under the same parallelism strategy. For example, SPPO can support a GPT-65B model with sequence lengths up to 1024K, while DeepSpeed-Ulysses is limited to 512K and Megatron-LM to 768K.
- Model-Specific Performance: For GPT-7B, SPPO achieves a 1.13x to 1.29x speedup over Megatron-LM. For GPT-65B, SPPO achieves speedups of 3.38x and 3.12x over Megatron-LM at sequence lengths of 600K and 640K, respectively. DeepSpeed-Ulysses is unable to scale effectively for GPT-65B beyond 512K.
- Sequence Length Scalability: SPPO demonstrates superior sequence length scalability compared to DeepSpeed and Megatron-LM with fixed GPU resources. This is achieved by decoupling from head-based partitioning and flexibly adjusting parallelism strategies.
In summary, SPPO introduces adaptive offloading and pipeline scheduling techniques to optimize memory and computational resource utilization in LLM training. Experimental results confirm that SPPO achieves significant throughput improvements, enhanced memory efficiency, and superior sequence length scalability compared to existing frameworks like Megatron-LM and DeepSpeed.