GreedySnake: SSD-Offloaded LLM Training
- GreedySnake is an SSD-offloaded training system for large language models that uses vertical scheduling to minimize redundant I/O.
- It overlaps CPU/SSD-based optimizer computations with forward passes, achieving 1.9–2.5× speedups compared to traditional methods.
- Empirical evaluations demonstrate that GreedySnake maintains statistical equivalence with baselines while significantly reducing memory and bandwidth bottlenecks.
GreedySnake is an SSD-offloaded training system designed to accelerate LLM training by mitigating GPU memory constraints and addressing I/O bottlenecks through vertical scheduling and efficient optimizer step overlapping. By rethinking the scheduling of micro-batches within each layer and strategically overlapping CPU/SSD optimizer steps with forward computation, GreedySnake significantly improves training throughput on commodity hardware while retaining strict statistical equivalence with baseline methods (Yue et al., 19 Dec 2025).
1. Motivation and Problem Setting
The escalation in LLM parameter counts (tens to hundreds of billions) has shifted the primary training bottleneck from raw GPU compute to memory capacity, particularly for single-node deployments. Existing systems such as ZeRO-Offload and ZeRO-Infinity address this by swapping activations, parameters, and optimizer states between GPU DRAM, host DRAM, and SSD. Although this enables training feasibility, it introduces unique challenges:
- Excessive I/O Volume: Standard "horizontal" gradient-accumulation schedules necessitate multiple passes of loading low-precision weights, writing/re-reading activation checkpoints, and frequent swapping of partial gradients for every micro-batch. For micro-batches, the incurred off-GPU traffic per iteration is dominated by terms like , , and for weights, checkpoints, and gradients, respectively.
- Limited Overlap with Optimizer Step: The optimizer step cannot commence until the entire horizontal sequence of micro-batches concludes backpropagation, severely restricting the ability to hide disk/CPU operations behind GPU computation.
GreedySnake introduces a "vertical" scheduling paradigm, executing all micro-batches for a given layer before moving to the next. This approach reduces redundant I/O by nearly a factor of and dramatically multiplies the window of possible overlap between backward passes and optimizer updates.
2. System Architecture and Data Movement
GreedySnake is implemented atop PyTorch 2.x, extending ZeRO-Infinity's asynchronous cpu_adam and I/O handling infrastructure. System design partitions each Transformer layer into:
- Low-precision parameters (forward and backward passes)
- Full-precision optimizer states (Adam momenta, variances)
- Activation checkpoints
- Inter-layer gradients
Data are dynamically allocated and transferred across three memory tiers: GPU DRAM, host DRAM, and NVMe-SSD. GreedySnake’s runtime consists of three key asynchronous coordinators:
- Parameter Coordinator: Manages parameter prefetching/writing across tiers on a layer-and-micro-batch granularity.
- Inter-layer Tensor Coordinator: Pipelines the offload and prefetch of activation checkpoints and inter-layer gradients for all micro-batches per layer.
- Optimizer Step Coordinator: Gathers full-precision gradients post-accumulation, performs cpu_adam updates in chunked form, and schedules partial overlap of optimizer work with subsequent forward passes.
This architecture enables a two-dimensional pipelined schedule over time and resources (as detailed in the representations of Figures 3–5 of the primary source), supporting overlapping of I/O, computation, and optimizer steps.
3. Vertical Scheduling: Algorithm and Analysis
The cornerstone of GreedySnake’s performance is its "vertical" scheduling algorithm. The typical workflow for layers and micro-batches is as follows:
- For each layer :
Forward Pass (all micro-batches):
- Backward Pass (all micro-batches):
- Optimizer Step and Parameter Writeback:
Relative to horizontal scheduling, vertical scheduling reduces I/O-bound operations such as weight loads ( per layer) and gradient buffer round-trips (), albeit at a modest increase in activation checkpoint I/O. The net effect is substantially lower off-GPU traffic, with bandwidth savings of up to 6× in models like GPT-65B, due to the disproportionate scaling between layer weights () and activation tensors ().
4. Optimization Step Overlapping
Despite efficiencies from vertical scheduling, the Adam optimizer step for each layer remains a high-latency operation, particularly when offloaded to CPU/SSD. GreedySnake mitigates this by deferring a fraction of the optimizer step per layer into the next iteration’s forward pass:
- In the backward pass, only a fraction of optimizer states is updated immediately. Remaining gradients ( fraction) are temporarily pinned in host memory.
- On the subsequent iteration’s forward pass, the delayed update is applied before layer parameters are transferred to GPU.
- This partitioning enables overlap of the optimizer’s SSD/CPU latency with both later backward passes of the current iteration and early forward passes of the next iteration.
Empirical results demonstrate a 10–15% reduction in I/O-bound iteration time with this mechanism, bringing measured overlap towards the theoretical optimum defined by the roofline model.
5. Roofline Model: Throughput and Bottleneck Analysis
The roofline model is adapted to delineate I/O and compute limitations. Throughput as a function of batch size is bounded by:
- I/O Roofline:
- Compute Roofline:
Ideal systems exhibit a sharp "knee" at batch size , where the system transitions from I/O- to compute-bound. Compared to prior single pass (Ratel) and horizontal accumulation (ZeRO-Infinity) schemes, GreedySnake’s vertical schedule and deferred optimizer step yield a much steeper, higher throughput curve, with as much as 2–3× smaller than horizontal baselines. This trajectory brings the system throughput close to the theoretical compute roofline.
6. Empirical Performance and Comparative Evaluation
Experiments span two hardware platforms: dual EPYC 7302 with 4×RTX A5000 and dual Xeon Platinum 8462Y+ with 4×A100, both using high-performance NVMe SSDs. Tested models include GPT-30B, GPT-65B, and GPT-175B. Baseline comparisons cover ZeRO-Infinity, Ratel, and TeraIO.
Key findings on A100 clusters:
- GPT-65B, 1 GPU: GreedySnake achieves ~63 TFLOP/sec, 1.96× faster than ZeRO-Infinity (~32 TFLOP/sec).
- GPT-65B, 4 GPUs: 62 vs. 32 TFLOP/sec/GPU (1.93×).
- GPT-175B, 1 GPU: 128 vs. 50 TFLOP/sec (2.53×).
Loss trajectories on the Pile dataset are statistically indistinguishable across methods, confirming no regression in convergence or model quality. Baselines such as Ratel and TeraIO achieve only 30–40% of GreedySnake’s throughput.
7. Implementation Details and Limitations
Key implementation elements:
- Coordinators: Asynchronous handlers for parameter/pre-fetch (Parameter Coordinator), activation-gradient offload (Inter-layer Tensor Coordinator), and optimizer/pipelined updates (Optimizer Step Coordinator).
- Granularity Management: SSD to host transfers move full layer tensors, while host to GPU transfers are micro-batch partitioned, maximizing PCIe bandwidth utilization.
- Pinned-memory Allocation: PyTorch’s power-of-two alignment for pinned buffers is managed via a dynamic programming knapsack solver to minimize memory waste.
- Auto-tuning: A linear program selects the combination that maximizes throughput within DRAM and iteration time constraints (see Algorithm 1 in the source).
Limitations and directions for further work:
- Results are restricted to single-node configurations; multi-node extension with RDMA or NVMe-over-Fabric remains an open challenge.
- SSD bandwidth is a persistent bottleneck; experimentation with PCIe 5.0 or in-storage compute (SmartInfinity) is ongoing.
- Layer size uniformity is currently assumed; highly imbalanced architectures (e.g., Mixture-of-Experts, LoRA) may require per-layer parameterization.
- Extremely small batch regimes () necessitate multiple GPU-resident gradient buffers, requiring trade-offs between DRAM footprint and I/O frequency. Hybrid scheduling (combining horizontal and vertical paradigms) might be required for DRAM-constrained environments.
In summary, GreedySnake’s vertical scheduling and overlapping optimizer strategy substantially reduce SSD and host-bandwidth pressure during LLM training and support dense, compute-efficient scheduling. These mechanisms deliver empirical speedups of 1.9–2.5× relative to the leading SSD-offloaded trainers, closely approximating the performance predicted by the roofline model (Yue et al., 19 Dec 2025).