SSD-Offloaded Training Systems

Updated 26 December 2025

SSD-offloaded training is a paradigm that utilizes SSDs as high-bandwidth, cost-efficient memory and compute extensions, mitigating GPU/DRAM constraints in LLMs, GNNs, and image classification.
The approach employs advanced scheduling techniques, such as vertical scheduling and inspector–executor separation, to optimize data transfer and minimize I/O stalls.
Efficient tensor offloading and caching algorithms significantly reduce memory usage and energy consumption, with reported DRAM savings exceeding 55% and near-zero I/O delays.

SSD-offloaded training is a paradigm in machine learning system design that leverages solid-state drives (SSDs) as high-bandwidth, cost-efficient extensions for both memory and compute pathways in large-scale neural network training. Numerous frameworks—activated by advances in NVMe SSD bandwidth, GPUDirect Storage, computational storage devices (CSDs), and optimized scheduling—demonstrate that SSDs can mitigate GPU/DRAM constraints and accelerate throughput across domains such as LLMs, GNNs, and image classification. This synthesis reviews architectures, pipeline transformations, scheduling strategies, caching algorithms, memory optimizations, and empirical performance reported in recent literature.

1. Architectures and Data Placement Strategies

SSD-offloaded training systems are structured in multi-tier memory hierarchies, commonly spanning GPU DRAM, CPU DRAM, and SSD storage. In classic LLM offloading frameworks such as ZeRO-Infinity (and its successors MemAscend (Liaw et al., 29 May 2025), TERAIO (Yuan et al., 6 Jun 2025), GreedySnake (Yue et al., 19 Dec 2025)), parameters, optimizer states, and checkpoints are dynamically staged between GPU memory, host memory, and SSDs. For GNN training on billion-scale graphs (Ginex (Park et al., 2022)), adjacency structure pointer arrays remain resident in DRAM, while feature tables and large indices are allocated contiguously on SSD as 4 KB-aligned raw files.

In computational storage device setups (STANNIS (HeydariGorji et al., 2020)), each SSD embeds an ARM-based engine capable of performing forward and backward propagation autonomously, storing all data locally to guarantee privacy and minimize host traffic.

For preprocessing-intensive pipelines (DDLP (Wei et al., 17 Apr 2024)), both standard SSDs and CSDs serve as primary data buffers, enabling the CPU and CSD to preprocess batches in parallel and route them to the GPU via direct PCIe paths, notably employing GPUDirect Storage for peer-to-peer DMA.

2. Pipeline Transformations and Scheduling

Strategies for pipeline reorganization are central to SSD-offloaded efficiency. GreedySnake (Yue et al., 19 Dec 2025) transitions from horizontal gradient accumulation scheduling—where each micro-batch is processed sequentially through all layers—to vertical scheduling, which fully executes a layer’s forward and backward pass across all micro-batches before proceeding. This reduces redundant parameter and checkpoint I/O, increases overlap between optimizer steps and compute, and brings practical throughput closer to ideal roofline predictions.

Ginex (Park et al., 2022) introduces the inspector–executor separation. Sampling ("inspector") is performed for a superbatch to profile node access patterns, followed by changeset precomputation, cache initialization, and a main execution loop ("executor") handling gather/transfer/compute. This decoupling allows for offline simulation and optimization of memory cache updates via Belady’s algorithm, drastically reducing SSD read misses.

DDLP (Wei et al., 17 Apr 2024) implements dual-pronged preprocessing, orchestrating CPU and CSD workers to process data from opposite ends of the dataset (MTE, WRR modes), calibrating work allocation dynamically by measured throughput and adapting GPU batch consumption based on readiness.

3. Optimal Caching and Data Movement Algorithms

Efficient SSD-offloading depends critically on minimizing random I/O and maximizing cache hit rates. In Ginex (Park et al., 2022), the caching formalism is $\min \sum_{v \in Y} t(v)$ over candidate sets $Y$ , solved exactly via Belady’s clairvoyant eviction: for each superbatch, future access times for each feature vector are precomputed, Top- $M$ selections are performed each iteration, and memory-resident cache content is updated accordingly. This design produces near-minimal SSD read misses (<10%) compared to conventional page caching.

TERAIO (Yuan et al., 6 Jun 2025) applies tensor lifetime analysis: each tensor's period of inactivity is estimated by profiling PyTorch operator launches; offload/prefetch windows are then scheduled to coincide with maximal memory pressure. A greedy benefit/cost heuristic selects tensor migration opportunities such that total iteration time is minimized under available bandwidth and DRAM constraints. Direct GPUDirect Storage enables bulk migration and prioritization of urgent tensors.

MemAscend (Liaw et al., 29 May 2025) attacks system memory bottlenecks by partitioning buffer pools according to tensor shapes (embedding, feedforward, key/value, query/output), eradicating internal fragmentation, and introducing zero-overhead allocation routines (malloc + cudaHostRegister), which altogether yield >55% DRAM savings. A fused overflow-check kernel (bitwise IEEE-754 checks) nearly eliminates peak CPU memory spikes and latency.

SSDTrain (Wu et al., 19 Aug 2024) implements full overlap of activation offload/prefetch with GPU compute, using Python hooks in PyTorch’s autograd to asynchronously schedule SSD transfers—enabling near-zero I/O stalls with a 47% peak activation memory reduction.

4. Integrated Optimizations: Scheduling, Memory, and I/O

Scheduler and memory placement are co-optimized to saturate throughput. GreedySnake (Yue et al., 19 Dec 2025) overlaps optimizer steps with forward passes, allowing a tunable delay ratio $\alpha$ such that a fraction of CPU-side updates is completed during subsequent GPU computation, directly subtracting from the critical I/O path in each iteration.

MemAscend (Liaw et al., 29 May 2025) bypasses ext4 + O_DIRECT for raw NVMe AIO/uring writes, stripes large tensors across multiple SSDs, and orchestrates buffer allocation to avoid over-alignment, improving write bandwidth by up to 20% and reducing latency 4×.

In DDLP (Wei et al., 17 Apr 2024), batch quotas and adaptive selection strategies are recalibrated epochwise to synchronize CPU/CSD throughput, while the GPU fetch logic dynamically chooses batches to maximize overlap. WRR scheduling achieves finer grain gains than MTE and results in higher energy efficiency and compute utilization.

5. Empirical Performance, Resource Savings, and Trade-Offs

Reported results demonstrate substantial gains across key metrics:

Throughput: Ginex increases GNN training throughput by 2.11× over PyG+ and 1.23–1.57× over Ali+PG (Park et al., 2022). GreedySnake delivers 1.96–2.53× higher LLM training throughput over ZeRO-Infinity at batch sizes practical for A100 hardware (Yue et al., 19 Dec 2025). TERAIO achieves 1.47× speedup over ZeRO-Offload/Infinity, reaching 80.7% of the ideal infinite-GPU-memory baseline (Yuan et al., 6 Jun 2025). DDLP exhibits up to 23.5% acceleration for ImageNet and 27.6% for CIFAR-10 (Wei et al., 17 Apr 2024).
Memory Efficiency: SSDTrain reduces activation peak memory usage by 47% (Wu et al., 19 Aug 2024); MemAscend cuts system DRAM footprint by 55.7% across diverse LLMs compared to ZeRO-Infinity, supporting contexts up to 131k tokens—8× higher than previously possible without extra hardware (Liaw et al., 29 May 2025).
Energy Consumption: STANNIS demonstrates 2.7× speedup and 69% energy reduction in federated DNN training relative to host-only operation, with strict privacy guarantees by keeping private data on-device (HeydariGorji et al., 2020).
Resource Utilization: DDLP reduces CPU/DRAM usage by up to 37.6% per batch (Wei et al., 17 Apr 2024).

Trade-offs arise in SSD cost, device placement/topology requirements, and dependencies on host/driver support. Overheads due to small batch sizes or non-4 KB-aligned tensor layouts may limit optimal performance. Full efficacy is achieved when SSD bandwidth and DRAM are co-provisioned for the expected data migration and compute overlap.

6. Practical Guidelines and System Design Considerations

Best practices distilled from recent systems include:

Use adaptive buffer pools by tensor shape, integrated into allocation routines (MemAscend (Liaw et al., 29 May 2025)).
Profile and tune batch sizes, superbatch sizes, and cache capacity to maximize memory utilization and minimize runtime stalls (Ginex (Park et al., 2022), GreedySnake (Yue et al., 19 Dec 2025)).
Offload tensors only when I/O can be overlapped with plentiful compute kernels; apply benefit/cost analyses for selection (TERAIO (Yuan et al., 6 Jun 2025)).
Deduplicate tensor storage to reduce I/O volume; rely on unique shape/timestamp identifiers (SSDTrain (Wu et al., 19 Aug 2024)).
Employ vertical scheduling to saturate throughput at smaller batch sizes; choose micro-batch count via LP tuning automation (GreedySnake (Yue et al., 19 Dec 2025)).
Bypass filesystem metadata to maximize SSD write/read bandwidth (MemAscend (Liaw et al., 29 May 2025)).
Leverage GPUDirect Storage for direct peer-to-peer migration, maximizing bandwidth (TERAIO (Yuan et al., 6 Jun 2025), DDLP (Wei et al., 17 Apr 2024), SSDTrain (Wu et al., 19 Aug 2024)).
Perform energy-constrained device scheduling and batch allocation when energy use is a priority (DDLP (Wei et al., 17 Apr 2024), STANNIS (HeydariGorji et al., 2020)).

7. Future Directions and Limitations

Key outstanding directions involve:

Extending SSD-offloading frameworks to remote/distributed scenarios (e.g., ZNS-sharded offload, multi-node activation checkpointing).
Integrating additional compute primitives (e.g., vector ALUs, systolic arrays) into CSDs for acceleration (STANNIS (HeydariGorji et al., 2020), ISP-ML (Choe et al., 2016)).
Joint optimization of precision scheduling, quantization, and tensor sparsity to further reduce I/O.
Deeper pipeline parallel scheduler integration—dynamically adapting micro-batch and offload cut-points based on model and hardware characteristics (SSDTrain (Wu et al., 19 Aug 2024)).
Structured co-design of host and SSD or CSD compute, enabling division of model layers between local and storage-resident execution (ISP-ML (Choe et al., 2016)).

Current limitations include strong hardware dependencies—GPUDirect Storage support, NVMe PCIe locality, SSD bandwidth saturation, non-uniform memory access, and device fragmentation in legacy PyTorch CachingHostAllocator. Some systems remain simulation- or prototype-based; further standardization will be necessary for broad production deployment.

References

Ginex: SSD-enabled Billion-scale Graph Neural Network Training on a Single Machine via Provably Optimal In-memory Caching (Park et al., 2022) STANNIS: Low-Power Acceleration of Deep Neural Network Training Using Computational Storage (HeydariGorji et al., 2020) Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage (Yuan et al., 6 Jun 2025) SSDTrain: An Activation Offloading Framework to SSDs for Faster LLM Training (Wu et al., 19 Aug 2024) MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning (Liaw et al., 29 May 2025) GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping (Yue et al., 19 Dec 2025) Dual-pronged deep learning preprocessing on heterogeneous platforms with CPU, GPU and CSD (Wei et al., 17 Apr 2024) Near-Data Processing for Differentiable Machine Learning Models (Choe et al., 2016)