Large-scale fine-tuning typically demands clusters of expensive GPUs because the optimizer states and activations of 100 B-parameter models exceed the 80 GB ceiling of today’s biggest devices. “Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU” introduces Fuyou, a PyTorch-based training framework that turns an off-the-shelf workstation (e.g., one RTX 4090, ≤768 GB DRAM, a handful of NVMe SSDs) into a platform capable of high-throughput fine-tuning of models as large as GPT-3-175 B—something state-of-the-art engines such as ZeRO-Infinity or Colossal-AI cannot do on similar hardware.
Key problems with existing off-loading systems
- CPU-memory bottleneck – ZeRO-Infinity keeps activations in DRAM; with ≤512 GB host RAM it cannot even fit 65 B parameters.
- Idle GPU time – ZeRO-Infinity performs out-of-core Adam after backward, leaving the GPU idle for 40 – 70 % of each step and achieving <30 % utilisation on a single A100.
Fuyou: system overview
The central idea is to treat SSD ⇆ CPU traffic as a first-class optimisation axis and co-design swapping, prefetching and computation so that all five resources—GPU compute, CPU compute, GPU↔CPU PCIe, CPU↔SSD bandwidth, SSD I/O—remain simultaneously busy.
Component pipeline
- Profiling phase
- Runs one step with everything swapped to SSD; records per-layer FLOPs, activation/parameter sizes, GPU/CPU times and link bandwidths.
- GPU-aware FIFO prefetcher
- Uses remaining GPU memory (after model/mini-batch allocations) as a sliding window buffer; pulls next parameters/activations as soon as space frees up, enabling continuous overlap of PCIe, SSD I/O and kernels.
- Synchronous out-of-core CPU Adam overlapped with backward
- Gradients streamed from GPU are immediately consumed by a separate CPU process; weight updates are synchronous (no delayed update, so convergence unchanged) yet hidden behind subsequent layer back-props.
- Delayed write-back: reading group i+1 from SSD overlaps writing group i, maximising SSD duplex bandwidth.
- GPU-CPU-SSD three-level activation swapping
- Activations initially leave GPU to DRAM; if DRAM pressure arises they are flushed to SSDs in a fully pipelined fashion.
- Automatic activation-swap scheduler
- Cost model predicts iteration time T_iter = T_fwd + T_back+opt where each term is max(compute, PCIe, SSD).
- Search space: swap coefficient D_f (bytes of activations written per step).
- Upper bound: limited by (a) free GPU memory; (b) overlap window T_max = T_bcomp − max(T_PCIe, T_SSD).
- Priority order: layers ranked by “swap-benefit factor” SBF = FLOPs / SwapTime; linear_4h_to_h gets highest priority.
- Iteratively increases D_f until predicted T_iter stops decreasing.
Implementation details
- Written in pure PyTorch 2.0; uses CUDA events for cross-process sync.
- Tested on 12× Intel P5510 SSDs (PCIe 4.0) but works with as few as 2; RAID not required.
- Requires no GPUDirect-Storage; runs on consumer boards.
Experimental results
Hardware: 1× A100-80 GB or 1× RTX 4090 (24 GB), 768 GB DDR4, up to 12 NVMe SSDs.
- Maximum trainable size (batch = 1)
- With 768 GB DRAM, Fuyou trains GPT-3-805 B on A100 and 276 B on RTX 4090.
- ZeRO-Infinity tops out at 135 B (A100) and 135 B (4090) with same DRAM; fails at 65 B if DRAM < 512 GB.
- Throughput (TFLOPS, higher is better)
- GPT-3-175 B, batch = 16 → 172 TFLOPS on A100 (86 % of peak), 87 TFLOPS on RTX 4090.
- GPT-3-13 B, batch = 32 → 202 TFLOPS (A100) vs 59 TFLOPS (ZeRO-Offload), 45 TFLOPS (ZeRO-Infinity), 30 TFLOPS (Colossal-AI).
- 3.4 × speed-up over ZeRO-Infinity on RTX 4090 (156 TFLOPS vs 45).
- Ablations
- Removing backward/optimizer overlap cuts throughput by up to 38 %.
- Disabling pipeline prefetch makes Fuyou only ~1.2–1.3× faster than ZeRO-Infinity; full pipeline raises this to 1.7–2.3×.
- Auto-swap scheduler selects near-optimal D_f for batch sizes {32, 64, 80}, matching the empirical minimum iteration time.
- Cost-effectiveness (tokens / s / $)
- Counting only compute + SSD hardware, Fuyou on 1 × 4090 + 6 SSDs delivers 1.7 × the tokens/s per dollar of a DGX-2 running Megatron-LM.
- Whole-server cost (incl. CPU/motherboard) still reaches 75 % of DGX-2 cost-efficiency, despite using a single GPU.
Take-aways and limitations
- NVMe SSDs are fast enough (~3–7 GB/s ea.) to act as an additional memory tier for fine-tuning if traffic is meticulously overlapped with compute.
- Synchronising but overlapping an out-of-core optimizer avoids convergence issues of asynchronous updates while keeping the GPU busy.
- Bottleneck shifts from DRAM to GPU memory once activations per layer exceed 24 GB-class GPUs; future work includes tensor-slicing or unified-memory tricks to push beyond 276 B on consumer cards.
- Multi-GPU extension (pipeline parallel + Fuyou’s off-load) is left for future research.
Fuyou therefore demonstrates that terascale fine-tuning is no longer exclusive to data-centre hardware; with careful system design, a single reasonably priced workstation can train models an order of magnitude larger than previously possible.