PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization (2503.01328v2)

Published 3 Mar 2025 in cs.LG, cs.AI, and cs.DC

Abstract: Pipeline parallelism (PP) is widely used for training LLMs, yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at \href{https://github.com/sail-sg/zero-bubble-pipeline-parallelism}{this url}.

Summary

The paper introduces PipeOffload, a pipeline parallelism approach that offloads activation memory to the host to significantly improve the scalability of training large language models.
It analyzes the feasibility of full offload based on the compute-to-offload ratio and proposes a selective offload strategy for scenarios where full offload is not optimal.
Experiments show PipeOffload reduces activation memory to less than a quarter compared to standard methods while maintaining similar throughput and enabling better scaling.

This paper introduces PipeOffload, a novel pipeline parallelism (PP) approach designed to improve the scalability of training LLMs by addressing the activation memory bottleneck. The central idea revolves around strategically offloading activation memory from the GPU to the host, exploiting the inherent temporal separation between forward and backward passes in PP. The authors observe that a significant portion of activation memory can be offloaded with minimal overhead, and they introduce a selective offload strategy for cases where full offload is not feasible.

Key aspects of the approach and findings include:

Memory Offload Analysis: The paper analyzes the feasibility of full activation memory offload based on the ratio $k = T_o/T_c$ $k = T_{o} / T_{c}$ , where
- $T_o$ $T_{o}$ is the round-trip time for transferring activation memory between device and host
  - $T_o = \frac{3(6h + s)}{B_o}$ $T_{o} = \frac{3 ( 6 h + s )}{B _{o}}$
    - $h$ is the hidden size
    - $s$ is the sequence length
    - $B_o$ is the PCI-E duplex bandwidth
- $T_c$ $T_{c}$ is the compute time for the forward and backward passes.
  - $T_c = \frac{3(6h + s)}{B_c}$ $T_{c} = \frac{3 ( 6 h + s )}{B _{c}}$
    - $h$ is the hidden size
    - $s$ is the sequence length
    - $B_c$ is the GPU compute bandwidth
The analysis indicates that full offload is possible when $k \le 1$ , which is often achievable for large models and sequence lengths.
Selective Offload Strategy: For scenarios where $k > 1$ , the paper proposes a selective offload strategy that prioritizes offloading activations with longer lifespans (i.e., longer gaps between forward and backward passes) to maximize peak memory reduction. The effectiveness of selective offload also depends on the pipeline schedule, with uniform repeating strategies offering better memory reduction compared to interleaving strategies.
Pipeline Schedule Optimization: The paper explores various pipeline schedules to balance memory usage and throughput. It extends the interleaving strategy into a generalized form, offering smooth memory reduction with minimal throughput loss. The zero-bubble strategy is used to reduce pipeline bubbles. The paper introduces GIS (Generalized Interleaved Schedule) and GIS-H schedules.
Implementation Details: The implementation focuses on minimizing offload overhead by employing direct recomputation on activation-heavy but computationally lightweight layers (e.g., GeLU), ensuring stable PCI-E bandwidth via a hardware-topology-aware strategy, and decreasing host-side memory capacity overhead by leveraging continuous buffers. The paper contrasts its single-stream approach for offloading and reloading with the separate streams approach in prior work, citing reduced latency fluctuations.
Experimental Evaluation: The method is evaluated on GPT-3-like models using up to 32 NVIDIA A100 GPUs. The results demonstrate that PipeOffload achieves significant activation memory reduction compared to interleaved 1F1B schedules while maintaining similar throughput. In particular, PO-H reduces activation memory to less than a quarter of that required by interleaved 1F1B, and PO-F further minimizes activation memory. The paper also presents a strong scaling analysis of per-device activation memory, showing superior scaling for PO-H and PO-F.
Comparison with Tensor Parallelism: The paper compares pure PP using PipeOffload with hybrid parallelism (interleaved 1F1B combined with tensor parallelism). The results show a 12%-19% acceleration in training due to the elimination of tensor parallelism.
Related Work: The paper contrasts with a recent work on activation memory offloading in PP, noting that the other work concluded that activation offload causes significant overhead and should be avoided if possible. This paper instead says that offload can be a free lunch in PP, and full activation is often feasible to make PP scalable.