Papers
Topics
Authors
Recent
Search
2000 character limit reached

DawnPiper: Scalable Pipeline Training

Updated 4 February 2026
  • DawnPiper is a memory-scalable pipeline parallel training framework that leverages node-level partitioning and theoretical guidelines to balance compute and memory across GPUs.
  • It employs deep learning compilation-based profiling to optimize fine-grained operations, enabling efficient micro-batch scheduling and reducing memory imbalances.
  • Cost-model-driven memory optimization in DawnPiper enables up to 11× larger batch sizes and significant throughput improvements compared to prior pipeline systems.

DawnPiper is a memory-scalable pipeline parallel training framework designed to address the limitations of existing deep learning pipeline systems, particularly the GPU memory imbalance that restricts model size and overall training efficiency. By combining fine-grained model partitioning enabled by deep learning compilation-based profiling, a rigorously derived pipeline partitioning theorem, and a cost-model-driven approach for memory optimization, DawnPiper significantly increases maximum trainable batch size and improves training throughput relative to established baselines such as vPipe and PipeDream (Peng et al., 9 May 2025).

1. Motivation and Core Problem

Pipeline parallelism, in the tradition of frameworks such as GPipe and PipeDream, partitions a neural network into \ell stages through which micro-batches are streamed, with each stage assigned to a distinct GPU. The central objective is to minimize the maximal per-stage computation time,

Tx=nNx(tnf+tnb),T_x = \sum_{n \in \mathcal{N}_x}(t^\text{f}_n + t^\text{b}_n),

while overlapping computation with inter-stage communication. However, layers within deep models often exhibit heterogeneous ratios of compute-time to activation and parameter memory. Pipeline schemes, especially asynchronous variants like PipeDream’s 1F1B, impose further memory overheads via uneven parameter replication—stage xx must store (x)(\ell-x) extra copies. Consequently, partitions balanced by compute can cause extreme memory imbalances, with over 40% of GPU memory left idle on certain stages, severely capping trainable batch sizes and diminishing scalability (Peng et al., 9 May 2025).

2. Deep Learning Compilation-Based Profiling

DawnPiper imports PyTorch models using torch.fx to generate a detailed tensor-level directed acyclic graph (DAG). Each node corresponds to a fine-grained operation, unleashing a substantially expanded space for partition and memory optimization at the node, rather than layer, granularity, and facilitates automatic code generation per partition.

The system profiles each node nn for:

  • tnft^\text{f}_n, tnbt^\text{b}_n: forward and backward execution times,
  • mnam^\text{a}_n, mnpm^\text{p}_n: activation and parameter memory requirements,
  • Saved-tensor timing: Determined via torch.autograd.graph.saved_tensors_hooks to precisely track activation lifespan.

This granular profiling informs both partition search and stage-wise memory optimization, enabling the placement of partition points at a much finer scale than previous frameworks.

3. Performance-Optimal Partitioning Theorem

DawnPiper formalizes the pipeline partitioning task as

minmax1xTxsubject topeak memory per stageMGPU.\min \max_{1 \leq x \leq \ell} T_x \quad \text{subject to} \quad \text{peak memory per stage} \leq M_\text{GPU}.

A full search over partitionings is exponential in number of nodes. DawnPiper leverages empirical observation that over 90% of nodes consume \leq 150 MB; modest adjustment of partition points entails limited overhead.

Theorem 1 (Partition-Range Theorem):

Given two-stage splitting, let ρcb\rho_\text{cb} denote the compute-balanced cut (equalizing total tf+tbt^\text{f} + t^\text{b}) and ρmb\rho_\text{mb} the memory-balanced cut (equalizing stage peak memory under the execution schedule). If

  1. Cumulative compute and memory grow monotonically with node order,
  2. Inter-stage communication time is negligible,
  3. Memory optimization opportunities are roughly uniform, then the globally optimal cut must fall within [ρmb,ρcb][\rho_\text{mb}, \rho_\text{cb}]. Shifting the cut right of ρcb\rho_\text{cb} worsens both compute and memory balance, while moving it left of ρmb\rho_\text{mb} creates a new bottleneck stage.

4. Binary Pipeline Partitioning Algorithm

DawnPiper extends Theorem 1 recursively to >2ℓ>2 using a binary (divide-and-conquer) search over permissible cut ranges at each hierarchy level. The process is implemented as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
function AdjacentPartition(nodes, sid):
  (ρ_cb, ρ_mb)  CompMemBalancedCut(nodes, sid)
  if both sides fit in GPU without optimization:
    return compute-balanced max-T
  candidates  list of cuts between ρ_mb and ρ_cb
  best_T  
  for ρ in candidates:
    (mopt, Tmax)  MemOptimizeForCut(nodes, sid, ρ)
    best_T  min(best_T, Tmax)
    if mopt==false: break
  return best_T

function BiPar(nodes, L, R):
  if RL == 1:
    return AdjacentPartition(nodes[L..R], L)
  mid  floor((L+R)/2)
  (ρ_cb, ρ_mb)  CompMemBalancedCut(nodes, mid)
  candidates  cuts in [ρ_mb, ρ_cb]
  best_T  
  for ρ in candidates:
    T_left  BiPar(nodes[L..mid], L, mid)
    T_right  BiPar(nodes[mid+1..R], mid+1, R)
    best_T  min(best_T, max(T_left, T_right))
    if memory-fail(mid-side): break
  return best_T

BiPar(full_computation_graph, 1, ℓ)

Here, CompMemBalancedCut rapidly computes ρcb\rho_\text{cb} and ρmb\rho_\text{mb} using prefix sums. The search interval [ρmb,ρcb][\rho_\text{mb}, \rho_\text{cb}] reduces search complexity from exponential to approximately φlog2\varphi^{\log_2 \ell}, with empirical φ100\varphi \approx 100 for models such as BERT (which has over 1,000 nodes).

5. Cost-Model-Based Memory Optimization

When a stage’s memory exceeds GPU limits after partitioning, DawnPiper integrates a tensor-level cost model based on Capuchin [Peng et al., ASPLOS '20] to select between swapping and recomputation for each tensor:

  • For tensor ii:
    • size(i)\text{size}(i): freed memory,
    • tswap(i)=size(i)/BWpeert_\text{swap}(i) = \text{size}(i)/\text{BW}_\text{peer},
    • trecomp(i)t_\text{recomp}(i): incremental recomputation time,
    • FreeTime(i)\text{FreeTime}(i): overlappable swap duration,
    • Overheadswap(i)=max(0,tswap(i)FreeTime(i))\text{Overhead}_\text{swap}(i) = \max(0, t_\text{swap}(i) - \text{FreeTime}(i)),
    • Overheadrecomp(i)=trecomp(i)\text{Overhead}_\text{recomp}(i) = t_\text{recomp}(i),
    • MSPS=Overhead/size(i)\text{MSPS}_\ast = \text{Overhead}_\ast/\text{size}(i) for both swap and recompute.

The greedy optimization algorithm sorts by minimal min(MSPSswap,MSPSrecomp)\min(\text{MSPS}_\text{swap}, \text{MSPS}_\text{recomp}) and applies the cheaper action for top candidates until the stage’s peak memory fits within MGPUM_\text{GPU}. This process yields near-optimal memory plans minimizing total execution time.

6. Empirical Evaluation and Benchmark Analysis

Experiments were conducted on servers with 8× NVIDIA A100 40 GB GPUs and PCIe 4.0 ×16 interconnects, across models including BERT (340M), GPT-2 (770M), T5 (780M), and AmoebaNet (28M), with =4ℓ = 4 and =8ℓ = 8 pipeline stages. Baselines were ZeRO-2/3, torchgpipe (GPipe), PipeDream, and vPipe.

Maximum Trainable Batch Size

  • Synchronous regime (=4,8ℓ=4,8): DawnPiper-S enables up to $4×$ larger batch size compared to vPipe-S, and up to $11×$ larger than PipeDream (asynchronous).
  • Asynchronous regime (1F1B): DawnPiper-AS supports $2.1×$ (ℓ=4) and $2.7×$ (ℓ=8) larger batch than vPipe-AS, and consistently $4.8$–$11×$ over PipeDream.

Training Throughput

  • Under identical batch sizes, DawnPiper delivers up to $1.5×$ speedup over vPipe synchronously, and $1.41×$ asynchronously as memory optimization becomes active.
  • On 4 GPUs: mean speedup over vPipe-AS—BERT: $1.20×$, GPT-2: $1.12×$, T5: $1.32×$, AmoebaNet: $1.24×$.
  • On 8 GPUs (ℓ=8): asynchronous DawnPiper is $1.15×$ faster on GPT-2, $1.34×$ on T5.

Per-Stage Balance Illustration

For T5 with =8ℓ=8:

  • vPipe: max–min memory gap 35.8%35.8\%; DawnPiper: 7.7%7.7\%,
  • GPU memory utilization: 17.8%17.8\% higher,
  • Compute-time skew (longest vs. shortest stage): 36%36\% (vPipe) reduced to 8%8\% (DawnPiper).

A plausible implication is that such fine-grained partitioning and optimization may enable training of models disproportionately large for a fixed hardware budget.

7. Relationship to Prior Work and Significance

The DawnPiper framework extends beyond previous partitioning approaches such as GPipe [Huang et al., NeurIPS ’19], PipeDream [Narayanan et al., SOSP ’19], and vPipe [Zhao et al., TPDS ’22], which generally partition at the coarse, layer-wise level and do not jointly optimize for memory heterogeneity and peak GPU occupancy. Tensor-level memory cost modeling is adapted from Capuchin [Peng et al., ASPLOS ’20], but integrated here directly into the partitioning process.

DawnPiper’s advances lie in its combination of DL-compiled per-node profiling, theorem-constrained partitioning, and aggressive yet cost-aware tensor memory management, collectively yielding up to 4–11× larger supported batch sizes and significant speedup. This positions DawnPiper as a scalable, theoretically grounded alternative in large-model pipeline training (Peng et al., 9 May 2025).

Framework Partitioning Granularity Memory Balancing Max Batch Size (relative to PipeDream)
GPipe Layer No
PipeDream Layer No
vPipe Layer Partial 1.1–2×
DawnPiper Node (fine-grained) Yes 4×–11×

These results highlight the substantial impact of node-level partitioning and cost-based memory optimization for pipeline-parallel scaling of modern deep models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DawnPiper.