Out-of-Core Training Strategies
- Out-of-Core Training is a set of techniques that enable training of models beyond on-device memory limits by judiciously offloading data, activations, and parameters.
- It employs methods like memory offloading, chunk-based state management, and pipelined I/O to balance compute and communication, achieving high throughput.
- These strategies are pivotal in training LLMs, GNNs, and other large-scale models, delivering near-baseline accuracy with significantly improved resource efficiency.
Out-of-core training strategies enable the training of machine learning and deep learning models whose data or parameter footprints exceed the available device memory by judiciously staging data, activations, or model states between fast (on-device) and slow (host, disk, or network) memory resources. Such strategies are now central to LLM pretraining, billion-scale graph neural network (GNN) learning, and high-dimensional optimization in computational science and kernel methods, underpinning hardware efficiency at the terabyte and 100B+ parameter scale.
1. Foundations of Out-of-Core Training
Out-of-core training (also: "extreme memory disaggregation") refers to techniques in which some constituent tensors—input data, model parameters, activations, gradients, optimizer states—are intentionally resident outside the "core" (typically GPU) device memory, and migrated on-demand according to a schedule that maximizes system throughput and model convergence (Sun et al., 2024, Yang et al., 2024, Sun et al., 2023, Waleffe et al., 2022, Hayakawa et al., 2020). The necessity of such strategies arises when the working set (e.g., , , or epoch-size) greatly exceeds (GPU RAM). Principal drivers include 100-B model sizes, large batch learning for stability, or highly-structured data (graphs, PDE simulators) requiring storage proportional to |V|, |E|, or .
Historically, out-of-core workflows were explored for SVM solvers and online large-scale optimization (Ramanan, 2013); their modern resurgence—coinciding with LLM and GNN scaling—integrates high-performance I/O pipelines and global caching, and incorporates adaptive scheduling to overlap compute and communications (Sun et al., 2024, Yang et al., 2024).
2. Memory Offloading and Model State Management
Key to all out-of-core systems is partitioning model state into core- and out-of-core-resident components. Representative approaches include:
- Centralized storage of parameters and optimizer states: LuWu stores all floating-point parameters and optimizer state (e.g., Adam , totaling ) on SSDs attached to a dedicated in-network optimizer node, with worker GPUs holding only microbatch shards and activations (Sun et al., 2024).
- Chunk-based model state offload: ProTrain divides model states (FP16 weights, FP32 masters, gradients, optimizer states) into uniform-size "chunks", designating a subset as GPU-resident (persistent) and staging others to host RAM. Before a chunk is needed for forward/backward, it is prefetched CPU→GPU, then offloaded after gradient reduction, with optimizations for overlap and buffer sizing (Yang et al., 2024).
- Optimizer execution off-GPU: By staging optimizer updates (e.g., Adam) entirely on a network node or CPU, both parameters and optimizer states are out-of-core, freeing compute device memory for activations and pipeline buffers alone (Sun et al., 2024).
The memory footprint for such a configuration is minimized to , with or 0 provisioning for 1 or more.
3. Data and Activation Management
Management of intermediate activations or dynamically-accessed data is the second principal axis:
- Pipelined swap, checkpointing, and recomputation: Out-of-core schedules (e.g., KARMA (Wahib et al., 2020), ProTrain (Yang et al., 2024), adaptive window-based scheduling (Hayakawa et al., 2020)) partition the computational DAG into "blocks" or "windows" of layers/functions. Swap-out/in decisions are based on block memory usage, and recomputation (checkpointing) trades compute for memory. Conducting swaps during executions of later blocks, and recomputation for select activations, maximizes device occupancy and saturates memory bandwidth.
- Virtual addressing to mitigate fragmentation: Frequent, asynchronous memory transfers can cause external fragmentation in the device allocator. OS-style virtual addressing maps large logical allocations to fixed-size physical chunks, eliminating external fragmentation and capping internal overhead (Hayakawa et al., 2020).
- Multi-level (GPU/CPU/SSD) caching: For GNN workloads, systems such as Helios and DiskGNN maintain a multi-level cache for vertex features. "Hot" nodes are kept in GPU DRAM, a mid-frequency cohort in pinned CPU memory, and the tail is staged on SSD in either disk caches or packed chunks to minimize read amplification (Sun et al., 2023, Liu et al., 2024).
4. Scheduling and Overlap Strategies
Performance of out-of-core training is fundamentally tied to the scheduling of memory transfers and overlap with computation. State-of-the-art systems adopt:
- NIC/Switch offload for collective operations: In-network aggregation (LuWu (Sun et al., 2024)) moves gradient summation to network switches, bypassing per-GPU kernel collectives and allowing GEMM/convolution kernels to run undisturbed. Parameter or gradient movement is managed directly by programmable NICs (GPUDirect DMA), and traffic is reduced by a factor of 2 compared to naive ring-allreduce, lowering communication time.
- Block-based pipelining: ProTrain (Yang et al., 2024) and adaptive window-based (Hayakawa et al., 2020) strategies both compute block-wise execution orders and select interleaving (swap, checkpoint, none) modes for each block, estimating overlap via measured 3. Optimal control loops enumerate feasible configurations, discarding any exceeding 4, and select the combination minimizing iteration time.
- Online data generation: For simulation-based learning (deep surrogates for PDEs), training is integrated with ongoing parallel simulation processes. Data is streamed via message passing (e.g., ZeroMQ), buffered in RAM, and sampled for mini-batch updates without ever being written to disk (Meyer et al., 2023).
- Prefetching and pipelined I/O: Many systems exploit producer–consumer pipelines: prefetching partitions or features into RAM while computation proceeds, overlapping SSD/host I/O and GPU compute (Waleffe et al., 2022, Liu et al., 2024). GNN systems such as MariusGNN schedule partition loads and edge bucket assignments according to an adaptive buffer policy (COMET), deferring bucket assignment to balance I/O with randomness for SGD (Waleffe et al., 2022).
5. System Architectures and Case Studies
A variety of architectures are exploited in out-of-core system design, exemplified as follows:
| System | Memory Disaggregation | Staging/Offload Location | Notable Scheduling |
|---|---|---|---|
| LuWu (Sun et al., 2024) | GPU ↔ SSD via AggNIC/SmartSwitch | Optimizer node (SSD), in-switch aggregation | Full overlap of compute & collective via network |
| ProTrain (Yang et al., 2024) | GPU ↔ CPU (host) | Chunked model/gradients in RAM, activation swap | Memory-aware auto-tuner, block/checkpoint interleaving |
| Helios (Sun et al., 2023) | GPU ↔ CPU ↔ SSD | Multi-level vertex-feature cache | GPU-initiated async disk IO, overlapped pipelines |
| DiskGNN (Liu et al., 2024) | GPU ↔ CPU ↔ SSD | Four-level features: GPU cache, CPU cache, disk cache, packed chunks | Offline sampling, batched caching, pipelined execution |
| KARMA (Wahib et al., 2020) | GPU ↔ CPU (host) | Layer blocks, host-resident optimizer | ILP-guided scheduling, swap/recompute, multi-node extension |
| MariusGNN (Waleffe et al., 2022) | GPU ↔ CPU ↔ SSD | Partitioned node/edge buckets | Pipelined prefetch, COMET adaptive scheduling |
Each system achieves high utilization by minimizing time lost to serial I/O, balancing RAM/SSD provisioning, and algorithmic adaptations to preserve convergence properties under pipelined or recomputation policies.
6. Algorithmic and Statistical Considerations
Out-of-core training introduces challenges for preserving convergence rates, randomness, and statistical properties:
- SVMs and dual coordinate solvers: For large-scale SVMs, only a small cache of "active" dual constraints (support vectors) need be retained in memory at any time. Streaming exploration discovers new margin-violating samples, while a batch solver re-optimizes the cache (Ramanan, 2013).
- Subspace clustering memory banks: In mini-batch DSC, out-of-core training is enabled by a "memory bank" tracking detached latent codes 5; matrix multiplications 6 reconstruct outputs for the current batch, and learning rates are damped to counteract stale bank entries (Jiang et al., 26 Jul 2025).
- Sampling, caching, and partitioning: For GNNs, sampling is often performed offline to precompute node access frequencies, which drive multilevel cache allocation and feature packing (Liu et al., 2024). Deferred randomization and reassignment of partitions/buckets can restore statistical properties typically lost in strict partitioned orders (Waleffe et al., 2022).
7. Empirical Performance and Scalability
Experimental evaluations consistently demonstrate the throughput and scalability of out-of-core strategies:
- LLM training: LuWu achieves 3.98× speedup over state-of-the-art on a 175B parameter model across an 8-worker (A100 GPUs) cluster, sustaining full overlap and pointer-only activations on device (Sun et al., 2024). ProTrain attains 1.43–2.71× throughput improvement vs. ZeRO-Offload, permitting training of 70B models on 4 × A100 with no parameter code modifications (Yang et al., 2024).
- GNNs: Helios sustains 7 GPU SM utilization on terabyte-scale graphs, outpacing baselines by up to 8 (vs GIDS) or 9 (vs CPU-based) (Sun et al., 2023). DiskGNN matches in-memory accuracy with 0 epoch throughput over Ginex and 1 over MariusGNN (Liu et al., 2024).
- Other domains: Out-of-core window-based scheduling trains ResNet-50 at 2 (7.5× over physical GPU RAM) at 3 baseline speed, improving both memory efficiency and trainable model size over prior LMS systems (Hayakawa et al., 2020). XGBoost's out-of-core boosting sustains accuracy on 4-row, 5-feature datasets, with negligible impact on AUC or convergence (Ou, 2020). For PDE surrogates, streaming/online generation improves generalization and reduces I/O by 6, exposing models to 7 more data (Meyer et al., 2023).
Empirical studies consistently observe 8 degradation in model accuracy versus in-core or fully in-memory baselines, with optimal pipelining yielding up to 9 resource efficiency or wall-clock reductions (Sun et al., 2024, Sun et al., 2023, Waleffe et al., 2022).
References
- LuWu: In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training (Sun et al., 2024)
- ProTrain: Efficient LLM Training via Memory-Aware Techniques (Yang et al., 2024)
- Helios: Out-of-core GNN Training on Terabyte-Scale Graphs (Sun et al., 2023)
- DiskGNN: I/O Efficiency and Accuracy for Out-of-Core GNN Training (Liu et al., 2024)
- KARMA: Scaling Distributed Deep Learning Workloads beyond the Memory Capacity (Wahib et al., 2020)
- Out-of-core Training for Extremely Large-Scale Neural Networks (Hayakawa et al., 2020)
- Out-of-Core GPU Gradient Boosting (Ou, 2020)
- MariusGNN: Out-of-Core Training of Graph Neural Networks (Waleffe et al., 2022)
- Dual coordinate solvers for large-scale structural SVMs (Ramanan, 2013)
- Mini-Batch Training for Deep Subspace Clustering Networks (Jiang et al., 26 Jul 2025)
- Training Deep Surrogate Models with Large Scale Online Learning (Meyer et al., 2023)