Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

51 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

524 3

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU (2403.06504v1)

Published 11 Mar 2024 in cs.DC

Abstract: Recent advances in LLMs have brought immense value to the world, with their superior capabilities stemming from the massive number of parameters they utilize. However, even the GPUs with the highest memory capacities, currently peaking at 80GB, are far from sufficient to accommodate these vast parameters and their associated optimizer states when conducting stochastic gradient descent-based optimization. One approach to hosting such huge models is to aggregate device memory from many GPUs. However, this approach introduces prohibitive costs for most academic researchers, who always have a limited budget for many high-end GPU servers. In this paper, we focus on huge model fine-tuning on a single, even low-end, GPU in a commodity server, which is accessible to most AI researchers. In such a scenario, the state-of-the-art work ZeRO-Infinity suffers from two severe issues when running in a commodity server: 1) low GPU utilization due to inefficient swapping, and 2) limited trainable model size due to CPU memory capacity. The underlying reason is that ZeRO-Infinity is optimized for running on high-end GPU servers. To this end, we present Fuyou, a low-cost training framework that enables efficient 100B huge model fine-tuning on a low-end server with a low-end GPU and limited CPU memory capacity. The key idea is to add the SSD-CPU communication as an optimization dimension and thus carefully co-optimize computation and data swapping from a systematic approach to maximize GPU utilization. The experimental results show that 1) Fuyou is able to fine-tune 175B GPT-3 on a consumer GPU RTX 4090 with high GPU utilization, while ZeRO-Infinity fails to fine-tune; and 2) when training a small GPT-3 13B model, Fuyou achieves 156 TFLOPS on an RTX 4090 GPU while ZeRO-Infinity only achieves 45 TFLOPS.

PDF HTML Abstract

Large-scale fine-tuning typically demands clusters of expensive GPUs because the optimizer states and activations of 100 B-parameter models exceed the 80 GB ceiling of today’s biggest devices. “Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU” introduces Fuyou, a PyTorch-based training framework that turns an off-the-shelf workstation (e.g., one RTX 4090, ≤768 GB DRAM, a handful of NVMe SSDs) into a platform capable of high-throughput fine-tuning of models as large as GPT-3-175 B—something state-of-the-art engines such as ZeRO-Infinity or Colossal-AI cannot do on similar hardware.

Key problems with existing off-loading systems

CPU-memory bottleneck – ZeRO-Infinity keeps activations in DRAM; with ≤512 GB host RAM it cannot even fit 65 B parameters.
Idle GPU time – ZeRO-Infinity performs out-of-core Adam after backward, leaving the GPU idle for 40 – 70 % of each step and achieving <30 % utilisation on a single A100.

Fuyou: system overview

The central idea is to treat SSD ⇆ CPU traffic as a first-class optimisation axis and co-design swapping, prefetching and computation so that all five resources—GPU compute, CPU compute, GPU↔CPU PCIe, CPU↔SSD bandwidth, SSD I/O—remain simultaneously busy.

Component pipeline

Profiling phase
- Runs one step with everything swapped to SSD; records per-layer FLOPs, activation/parameter sizes, GPU/CPU times and link bandwidths.
GPU-aware FIFO prefetcher
- Uses remaining GPU memory (after model/mini-batch allocations) as a sliding window buffer; pulls next parameters/activations as soon as space frees up, enabling continuous overlap of PCIe, SSD I/O and kernels.
Synchronous out-of-core CPU Adam overlapped with backward
- Gradients streamed from GPU are immediately consumed by a separate CPU process; weight updates are synchronous (no delayed update, so convergence unchanged) yet hidden behind subsequent layer back-props.
- Delayed write-back: reading group i+1 from SSD overlaps writing group i, maximising SSD duplex bandwidth.
GPU-CPU-SSD three-level activation swapping
- Activations initially leave GPU to DRAM; if DRAM pressure arises they are flushed to SSDs in a fully pipelined fashion.
Automatic activation-swap scheduler
- Cost model predicts iteration time T_iter = T_fwd + T_back+opt where each term is max(compute, PCIe, SSD).
- Search space: swap coefficient D_f (bytes of activations written per step).
- Upper bound: limited by (a) free GPU memory; (b) overlap window T_max = T_bcomp − max(T_PCIe, T_SSD).
- Priority order: layers ranked by “swap-benefit factor” SBF = FLOPs / SwapTime; linear_4h_to_h gets highest priority.
- Iteratively increases D_f until predicted T_iter stops decreasing.

Implementation details

Written in pure PyTorch 2.0; uses CUDA events for cross-process sync.
Tested on 12× Intel P5510 SSDs (PCIe 4.0) but works with as few as 2; RAID not required.
Requires no GPUDirect-Storage; runs on consumer boards.

Experimental results

Hardware: 1× A100-80 GB or 1× RTX 4090 (24 GB), 768 GB DDR4, up to 12 NVMe SSDs.

Maximum trainable size (batch = 1)
- With 768 GB DRAM, Fuyou trains GPT-3-805 B on A100 and 276 B on RTX 4090.
- ZeRO-Infinity tops out at 135 B (A100) and 135 B (4090) with same DRAM; fails at 65 B if DRAM < 512 GB.
Throughput (TFLOPS, higher is better)
- GPT-3-175 B, batch = 16 → 172 TFLOPS on A100 (86 % of peak), 87 TFLOPS on RTX 4090.
- GPT-3-13 B, batch = 32 → 202 TFLOPS (A100) vs 59 TFLOPS (ZeRO-Offload), 45 TFLOPS (ZeRO-Infinity), 30 TFLOPS (Colossal-AI).
- 3.4 × speed-up over ZeRO-Infinity on RTX 4090 (156 TFLOPS vs 45).
Ablations
- Removing backward/optimizer overlap cuts throughput by up to 38 %.
- Disabling pipeline prefetch makes Fuyou only ~1.2–1.3× faster than ZeRO-Infinity; full pipeline raises this to 1.7–2.3×.
- Auto-swap scheduler selects near-optimal D_f for batch sizes {32, 64, 80}, matching the empirical minimum iteration time.
Cost-effectiveness (tokens / s / $)
- Counting only compute + SSD hardware, Fuyou on 1 × 4090 + 6 SSDs delivers 1.7 × the tokens/s per dollar of a DGX-2 running Megatron-LM.
- Whole-server cost (incl. CPU/motherboard) still reaches 75 % of DGX-2 cost-efficiency, despite using a single GPU.

Take-aways and limitations

NVMe SSDs are fast enough (~3–7 GB/s ea.) to act as an additional memory tier for fine-tuning if traffic is meticulously overlapped with compute.
Synchronising but overlapping an out-of-core optimizer avoids convergence issues of asynchronous updates while keeping the GPU busy.
Bottleneck shifts from DRAM to GPU memory once activations per layer exceed 24 GB-class GPUs; future work includes tensor-slicing or unified-memory tricks to push beyond 276 B on consumer cards.
Multi-GPU extension (pipeline parallel + Fuyou’s off-load) is left for future research.

Fuyou therefore demonstrates that terascale fine-tuning is no longer exclusive to data-centre hardware; with careful system design, a single reasonably priced workstation can train models an order of magnitude larger than previously possible.

PDF Markdown Bookmark Chat (Pro)

References (61)

Authors (7)

Changyue Liao (2 papers)
Mo Sun (5 papers)
Zihan Yang (15 papers)
Kaiqi Chen (12 papers)
Binhang Yuan (45 papers)
Fei Wu (317 papers)
Zeke Wang (17 papers)

Citations (2)

View on Semantic Scholar

Tweets

https://twitter.com/_akhaliq/status/1767393991727657262

https://twitter.com/dejanseo/status/1768107321664938488

https://twitter.com/essobi/status/1767611138961711156

https://twitter.com/knishimae0531/status/1767517653370085562

https://twitter.com/knishimae0531/status/1767750917733158928

https://twitter.com/susumuota/status/1774226699695129074

HackerNews

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-Tuning on a Single GPU (3 points, 1 comment)