Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training (2504.14519v1)

Published 20 Apr 2025 in cs.LG and cs.AI

Abstract: Pipeline Parallelism (PP) serves as a crucial technique for training LLMs, owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context scenarios, existing pipeline parallelism methods fail to address the substantial activation memory pressure, primarily due to the peak memory consumption resulting from the accumulation of activations across multiple microbatches. Moreover, these approaches inevitably introduce considerable pipeline bubbles, further hindering efficiency. To tackle these challenges, we propose SlimPipe, a novel approach to fine-grained pipeline parallelism that employs uniform sequence slicing coupled with one-forward-one-backward (1F1B) schedule. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. Although the slices are evenly partitioned, the computation cost is not equal across slices due to causal attention. We develop a sophisticated workload redistribution technique to address this load imbalance. SlimPipe achieves (1) near-zero memory overhead and (2) minimal pipeline bubbles simultaneously. The effectiveness of SlimPipe has been proven by thorough testing with diverse model architectures, context window sizes, and SlimPipe-specific configurations. For example, on the Llama 70B model, compared to state-of-the-art methods, SlimPipe significantly boosts the Model FLOPs Utilization (MFU) to up to $1.57\times$ for a context length of 512K. More notably, for a context length of 2048K, it maintains over 45% utilization on 256 NVIDIA Hopper 80GB GPUs, while other approaches either suffer significant performance drops or fail entirely due to memory constraints.

Summary

  • The paper introduces a pipeline parallelism scheme that slices input sequences to drastically reduce activation memory compared to conventional methods.
  • It implements dynamic attention workload redistribution to balance computations, significantly mitigating pipeline bubbles in long-context training.
  • Results demonstrate up to 1.57x speedup and high model FLOPs utilization, enabling training of ultra-long context LLMs without memory scaling issues.

This paper introduces SlimPipe, a novel pipeline parallelism (PP) technique designed to efficiently train LLMs with very long contexts, addressing the key limitations of existing methods: high activation memory consumption and significant pipeline bubbles (idle time).

Challenges with Existing PP Methods for Long Contexts:

  1. Activation Memory Bottleneck: While standard PP reduces memory for model states (parameters, gradients, optimizers) by splitting layers across devices (pp devices), the activation memory remains constant. This is because methods like 1F1B accumulate activations for pp microbatches during the pipeline warm-up/steady phase. This becomes prohibitive for long sequences where activation memory dominates.
  2. Pipeline Bubbles:
    • Warm-up/Cool-down Bubbles: Standard schedules like 1F1B require filling and draining the pipeline, causing idle time, especially detrimental with long contexts where the number of microbatches (mm) is often small for a fixed global token count.
    • Imbalance Bubbles: Approaches like ZB-V, designed to reduce bubbles by splitting the backward pass, suffer from imbalances because the computation times for forward (TfT_f), activation gradient (TbT_b), and weight gradient (TwT_w) are inherently unequal in Transformers (especially attention layers where Tw0T_w \approx 0 and Tb>TfT_b > T_f), leading to new bubbles. Causal attention also introduces imbalance.
  3. Vocabulary Layer Imbalance: The large vocabulary projection layer, typically assigned to the last PP device, creates significant compute and memory load imbalance.

SlimPipe Approach:

SlimPipe introduces a fine-grained PP approach based on uniform sequence slicing combined with a modified 1F1B schedule.

  1. Uniform Sequence Slicing & Slice-wise 1F1B:
    • Each input sequence within a microbatch is divided into nn equal-length slices (nn is typically a multiple of pp).
    • The pipeline operates on these slices using a 1F1B schedule adapted for slices. Backward passes for slices within a sequence occur in reverse order of their forward passes (LIFO).
    • This drastically reduces activation memory. Instead of storing pp full microbatches, SlimPipe only needs to store activations equivalent to roughly one microbatch (split into pp parts across devices), approaching Ma/pM_a/p (MaM_a = activation size per microbatch). The peak memory becomes (1+2(p1)/n)Map(1 + 2(p-1)/n) \frac{M_a}{p}.
    • Warm-up/cool-down bubbles are significantly reduced (by a factor related to nn) because the pipeline fills faster with smaller slices.
  2. Attention Context Exchange (Workload Redistribution):
    • Uniform slicing causes load imbalance because causal attention makes later slices (attending to more previous tokens via KV cache) computationally heavier.
    • SlimPipe dynamically redistributes the attention workload. Devices processing computationally heavier slices send their query (Q) and parts of their key/value (K/V) cache to devices processing lighter slices.
    • These lighter-loaded devices perform the partial attention computation and send the results back, which are then combined (using online softmax).
    • This balances the computational load across devices processing different slices at the same time, effectively eliminating imbalance bubbles caused by causal attention.
  3. Vocabulary Parallelism:
    • To address the output layer imbalance, SlimPipe parallelizes the final vocabulary projection (GEMM) across all pp pipeline devices (column-wise parallelism).
    • Input hidden states are broadcast to all PP devices. Each computes a shard of the vocabulary logits.
    • The cross-entropy loss is calculated directly on the sharded logits, synchronizing only small scalar statistics, avoiding large logit communication. This balances both compute and memory for the output layer.

Implementation Details:

  • Implemented on Megatron-LM using PyTorch.
  • Uses memory-efficient attention (cuDNN SDPA, similar to FlashAttention), SwiGLU, and RMSNorm implementations.
  • Chunked KV Cache: Stores KV cache as a list of slice-sized tensors to avoid memory fragmentation.
  • Early Key-Value Exchange: Optimizes context exchange by overlapping communication with computation of previous slices.
  • Commutated Context Parallelism: Optimizes standard Context Parallelism (CP) for use with KV cache by communicating Q/Output instead of K/V repeatedly.

Evaluation Results:

  • Memory Reduction: Demonstrated near-inverse scaling of total memory (activations + model states) with PP size pp, unlike classic PP where activation memory is constant.
  • Bubble Reduction & Efficiency: Achieved significantly higher Model FLOPs Utilization (MFU) compared to Megatron-LM (interleaved 1F1B) and DeepSpeed (Ulysses), with speedups up to 1.57x, especially pronounced for longer contexts and larger models/GPU counts.
  • Scalability: Showed better scalability than baselines, avoiding OOM errors or configuration limitations encountered by others in challenging long-context scenarios.
  • Ultra-Long Context Training: Successfully trained models like Llama 70B with 2048K context and Mixtral 8x7B with 4096K context on 256 GPUs, maintaining high MFU (>40-45%) when combined with activation offloading.
  • Direct PP Scheme Comparison: Outperformed GPipe, 1F1B, Interleaved 1F1B, ZB-V, and V-Half in both MFU and memory efficiency for long-context training of Llama 13B.

Conclusion:

SlimPipe offers a memory-thrifty and efficient pipeline parallelism solution for training LLMs with extremely long contexts. By introducing uniform sequence slicing, attention context exchange for load balancing, and vocabulary parallelism, it simultaneously minimizes activation memory overhead and pipeline bubbles, enabling significant throughput improvements and pushing the boundaries of trainable context lengths.