DawnPiper: Scalable Pipeline Training

Updated 4 February 2026

DawnPiper is a memory-scalable pipeline parallel training framework that leverages node-level partitioning and theoretical guidelines to balance compute and memory across GPUs.
It employs deep learning compilation-based profiling to optimize fine-grained operations, enabling efficient micro-batch scheduling and reducing memory imbalances.
Cost-model-driven memory optimization in DawnPiper enables up to 11× larger batch sizes and significant throughput improvements compared to prior pipeline systems.

DawnPiper is a memory-scalable pipeline parallel training framework designed to address the limitations of existing deep learning pipeline systems, particularly the GPU memory imbalance that restricts model size and overall training efficiency. By combining fine-grained model partitioning enabled by deep learning compilation-based profiling, a rigorously derived pipeline partitioning theorem, and a cost-model-driven approach for memory optimization, DawnPiper significantly increases maximum trainable batch size and improves training throughput relative to established baselines such as vPipe and PipeDream (Peng et al., 9 May 2025).

1. Motivation and Core Problem

Pipeline parallelism, in the tradition of frameworks such as GPipe and PipeDream, partitions a neural network into $\ell$ stages through which micro-batches are streamed, with each stage assigned to a distinct GPU. The central objective is to minimize the maximal per-stage computation time,

$T_x = \sum_{n \in \mathcal{N}_x}(t^\text{f}_n + t^\text{b}_n),$

while overlapping computation with inter-stage communication. However, layers within deep models often exhibit heterogeneous ratios of compute-time to activation and parameter memory. Pipeline schemes, especially asynchronous variants like PipeDream’s 1F1B, impose further memory overheads via uneven parameter replication—stage $x$ must store $(\ell-x)$ extra copies. Consequently, partitions balanced by compute can cause extreme memory imbalances, with over 40% of GPU memory left idle on certain stages, severely capping trainable batch sizes and diminishing scalability (Peng et al., 9 May 2025).

2. Deep Learning Compilation-Based Profiling

DawnPiper imports PyTorch models using torch.fx to generate a detailed tensor-level directed acyclic graph (DAG). Each node corresponds to a fine-grained operation, unleashing a substantially expanded space for partition and memory optimization at the node, rather than layer, granularity, and facilitates automatic code generation per partition.

The system profiles each node $n$ for:

$t^\text{f}_n$ , $t^\text{b}_n$ : forward and backward execution times,
$m^\text{a}_n$ , $m^\text{p}_n$ : activation and parameter memory requirements,
Saved-tensor timing: Determined via torch.autograd.graph.saved_tensors_hooks to precisely track activation lifespan.

This granular profiling informs both partition search and stage-wise memory optimization, enabling the placement of partition points at a much finer scale than previous frameworks.

3. Performance-Optimal Partitioning Theorem

DawnPiper formalizes the pipeline partitioning task as

$\min \max_{1 \leq x \leq \ell} T_x \quad \text{subject to} \quad \text{peak memory per stage} \leq M_\text{GPU}.$

A full search over partitionings is exponential in number of nodes. DawnPiper leverages empirical observation that over 90% of nodes consume $\leq$ 150 MB; modest adjustment of partition points entails limited overhead.

Theorem 1 (Partition-Range Theorem):

Given two-stage splitting, let $\rho_\text{cb}$ denote the compute-balanced cut (equalizing total $t^\text{f} + t^\text{b}$ ) and $\rho_\text{mb}$ the memory-balanced cut (equalizing stage peak memory under the execution schedule). If

Cumulative compute and memory grow monotonically with node order,
Inter-stage communication time is negligible,
Memory optimization opportunities are roughly uniform, then the globally optimal cut must fall within $[\rho_\text{mb}, \rho_\text{cb}]$ . Shifting the cut right of $\rho_\text{cb}$ worsens both compute and memory balance, while moving it left of $\rho_\text{mb}$ creates a new bottleneck stage.

4. Binary Pipeline Partitioning Algorithm

DawnPiper extends Theorem 1 recursively to $ℓ>2$ using a binary (divide-and-conquer) search over permissible cut ranges at each hierarchy level. The process is implemented as follows:

function AdjacentPartition(nodes, sid):
  (ρ_cb, ρ_mb) ← CompMemBalancedCut(nodes, sid)
  if both sides fit in GPU without optimization:
    return compute-balanced max-T
  candidates ← list of cuts between ρ_mb and ρ_cb
  best_T ← ∞
  for ρ in candidates:
    (mopt, Tmax) ← MemOptimizeForCut(nodes, sid, ρ)
    best_T ← min(best_T, Tmax)
    if mopt==false: break
  return best_T

function BiPar(nodes, L, R):
  if R–L == 1:
    return AdjacentPartition(nodes[L..R], L)
  mid ← floor((L+R)/2)
  (ρ_cb, ρ_mb) ← CompMemBalancedCut(nodes, mid)
  candidates ← cuts in [ρ_mb, ρ_cb]
  best_T ← ∞
  for ρ in candidates:
    T_left ← BiPar(nodes[L..mid], L, mid)
    T_right ← BiPar(nodes[mid+1..R], mid+1, R)
    best_T ← min(best_T, max(T_left, T_right))
    if memory-fail(mid-side): break
  return best_T

BiPar(full_computation_graph, 1, ℓ)

Here, CompMemBalancedCut rapidly computes $\rho_\text{cb}$ and $\rho_\text{mb}$ using prefix sums. The search interval $[\rho_\text{mb}, \rho_\text{cb}]$ reduces search complexity from exponential to approximately $\varphi^{\log_2 \ell}$ , with empirical $\varphi \approx 100$ for models such as BERT (which has over 1,000 nodes).

5. Cost-Model-Based Memory Optimization

When a stage’s memory exceeds GPU limits after partitioning, DawnPiper integrates a tensor-level cost model based on Capuchin [Peng et al., ASPLOS '20] to select between swapping and recomputation for each tensor:

For tensor $i$ $i$ :
- $\text{size}(i)$ : freed memory,
- $t_\text{swap}(i) = \text{size}(i)/\text{BW}_\text{peer}$ ,
- $t_\text{recomp}(i)$ : incremental recomputation time,
- $\text{FreeTime}(i)$ : overlappable swap duration,
- $\text{Overhead}_\text{swap}(i) = \max(0, t_\text{swap}(i) - \text{FreeTime}(i))$ ,
- $\text{Overhead}_\text{recomp}(i) = t_\text{recomp}(i)$ ,
- $\text{MSPS}_\ast = \text{Overhead}_\ast/\text{size}(i)$ for both swap and recompute.

The greedy optimization algorithm sorts by minimal $\min(\text{MSPS}_\text{swap}, \text{MSPS}_\text{recomp})$ and applies the cheaper action for top candidates until the stage’s peak memory fits within $M_\text{GPU}$ . This process yields near-optimal memory plans minimizing total execution time.

6. Empirical Evaluation and Benchmark Analysis

Experiments were conducted on servers with 8× NVIDIA A100 40 GB GPUs and PCIe 4.0 ×16 interconnects, across models including BERT (340M), GPT-2 (770M), T5 (780M), and AmoebaNet (28M), with $ℓ = 4$ and $ℓ = 8$ pipeline stages. Baselines were ZeRO-2/3, torchgpipe (GPipe), PipeDream, and vPipe.

Maximum Trainable Batch Size

Synchronous regime ( $ℓ=4,8$ ): DawnPiper-S enables up to $4×$ larger batch size compared to vPipe-S, and up to $11×$ larger than PipeDream (asynchronous).
Asynchronous regime (1F1B): DawnPiper-AS supports $2.1×$ (ℓ=4) and $2.7×$ (ℓ=8) larger batch than vPipe-AS, and consistently $4.8$–$11×$ over PipeDream.

Training Throughput

Under identical batch sizes, DawnPiper delivers up to $1.5×$ speedup over vPipe synchronously, and $1.41×$ asynchronously as memory optimization becomes active.
On 4 GPUs: mean speedup over vPipe-AS—BERT: $1.20×$, GPT-2: $1.12×$, T5: $1.32×$, AmoebaNet: $1.24×$.
On 8 GPUs (ℓ=8): asynchronous DawnPiper is $1.15×$ faster on GPT-2, $1.34×$ on T5.

Per-Stage Balance Illustration

For T5 with $ℓ=8$ :

vPipe: max–min memory gap $35.8\%$ ; DawnPiper: $7.7\%$ ,
GPU memory utilization: $17.8\%$ higher,
Compute-time skew (longest vs. shortest stage): $36\%$ (vPipe) reduced to $8\%$ (DawnPiper).

A plausible implication is that such fine-grained partitioning and optimization may enable training of models disproportionately large for a fixed hardware budget.

7. Relationship to Prior Work and Significance

The DawnPiper framework extends beyond previous partitioning approaches such as GPipe [Huang et al., NeurIPS ’19], PipeDream [Narayanan et al., SOSP ’19], and vPipe [Zhao et al., TPDS ’22], which generally partition at the coarse, layer-wise level and do not jointly optimize for memory heterogeneity and peak GPU occupancy. Tensor-level memory cost modeling is adapted from Capuchin [Peng et al., ASPLOS ’20], but integrated here directly into the partitioning process.

DawnPiper’s advances lie in its combination of DL-compiled per-node profiling, theorem-constrained partitioning, and aggressive yet cost-aware tensor memory management, collectively yielding up to 4–11× larger supported batch sizes and significant speedup. This positions DawnPiper as a scalable, theoretically grounded alternative in large-model pipeline training (Peng et al., 9 May 2025).

Framework	Partitioning Granularity	Memory Balancing	Max Batch Size (relative to PipeDream)
GPipe	Layer	No	1×
PipeDream	Layer	No	1×
vPipe	Layer	Partial	1.1–2×
DawnPiper	Node (fine-grained)	Yes	4×–11×

These results highlight the substantial impact of node-level partitioning and cost-based memory optimization for pipeline-parallel scaling of modern deep models.

Markdown Report Issue Upgrade to Chat

References (1)

DawnPiper: A Memory-scablable Pipeline Parallel Training Framework (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DawnPiper.

DawnPiper: Scalable Pipeline Training

1. Motivation and Core Problem

2. Deep Learning Compilation-Based Profiling

3. Performance-Optimal Partitioning Theorem

4. Binary Pipeline Partitioning Algorithm

5. Cost-Model-Based Memory Optimization

6. Empirical Evaluation and Benchmark Analysis

Maximum Trainable Batch Size

Training Throughput

Per-Stage Balance Illustration

7. Relationship to Prior Work and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DawnPiper: Scalable Pipeline Training

1. Motivation and Core Problem

2. Deep Learning Compilation-Based Profiling

3. Performance-Optimal Partitioning Theorem

4. Binary Pipeline Partitioning Algorithm

5. Cost-Model-Based Memory Optimization

6. Empirical Evaluation and Benchmark Analysis

Maximum Trainable Batch Size

Training Throughput

Per-Stage Balance Illustration

7. Relationship to Prior Work and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research