Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU (2403.06504v1)

Published 11 Mar 2024 in cs.DC
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU

Abstract: Recent advances in LLMs have brought immense value to the world, with their superior capabilities stemming from the massive number of parameters they utilize. However, even the GPUs with the highest memory capacities, currently peaking at 80GB, are far from sufficient to accommodate these vast parameters and their associated optimizer states when conducting stochastic gradient descent-based optimization. One approach to hosting such huge models is to aggregate device memory from many GPUs. However, this approach introduces prohibitive costs for most academic researchers, who always have a limited budget for many high-end GPU servers. In this paper, we focus on huge model fine-tuning on a single, even low-end, GPU in a commodity server, which is accessible to most AI researchers. In such a scenario, the state-of-the-art work ZeRO-Infinity suffers from two severe issues when running in a commodity server: 1) low GPU utilization due to inefficient swapping, and 2) limited trainable model size due to CPU memory capacity. The underlying reason is that ZeRO-Infinity is optimized for running on high-end GPU servers. To this end, we present Fuyou, a low-cost training framework that enables efficient 100B huge model fine-tuning on a low-end server with a low-end GPU and limited CPU memory capacity. The key idea is to add the SSD-CPU communication as an optimization dimension and thus carefully co-optimize computation and data swapping from a systematic approach to maximize GPU utilization. The experimental results show that 1) Fuyou is able to fine-tune 175B GPT-3 on a consumer GPU RTX 4090 with high GPU utilization, while ZeRO-Infinity fails to fine-tune; and 2) when training a small GPT-3 13B model, Fuyou achieves 156 TFLOPS on an RTX 4090 GPU while ZeRO-Infinity only achieves 45 TFLOPS.

Large-scale fine-tuning typically demands clusters of expensive GPUs because the optimizer states and activations of 100 B-parameter models exceed the 80 GB ceiling of today’s biggest devices. “Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU” introduces Fuyou, a PyTorch-based training framework that turns an off-the-shelf workstation (e.g., one RTX 4090, ≤768 GB DRAM, a handful of NVMe SSDs) into a platform capable of high-throughput fine-tuning of models as large as GPT-3-175 B—something state-of-the-art engines such as ZeRO-Infinity or Colossal-AI cannot do on similar hardware.


Key problems with existing off-loading systems

  1. CPU-memory bottleneck – ZeRO-Infinity keeps activations in DRAM; with ≤512 GB host RAM it cannot even fit 65 B parameters.
  2. Idle GPU time – ZeRO-Infinity performs out-of-core Adam after backward, leaving the GPU idle for 40 – 70 % of each step and achieving <30 % utilisation on a single A100.

Fuyou: system overview

The central idea is to treat SSD ⇆ CPU traffic as a first-class optimisation axis and co-design swapping, prefetching and computation so that all five resources—GPU compute, CPU compute, GPU↔CPU PCIe, CPU↔SSD bandwidth, SSD I/O—remain simultaneously busy.

Component pipeline

  1. Profiling phase
    • Runs one step with everything swapped to SSD; records per-layer FLOPs, activation/parameter sizes, GPU/CPU times and link bandwidths.
  2. GPU-aware FIFO prefetcher
    • Uses remaining GPU memory (after model/mini-batch allocations) as a sliding window buffer; pulls next parameters/activations as soon as space frees up, enabling continuous overlap of PCIe, SSD I/O and kernels.
  3. Synchronous out-of-core CPU Adam overlapped with backward
    • Gradients streamed from GPU are immediately consumed by a separate CPU process; weight updates are synchronous (no delayed update, so convergence unchanged) yet hidden behind subsequent layer back-props.
    • Delayed write-back: reading group i+1 from SSD overlaps writing group i, maximising SSD duplex bandwidth.
  4. GPU-CPU-SSD three-level activation swapping
    • Activations initially leave GPU to DRAM; if DRAM pressure arises they are flushed to SSDs in a fully pipelined fashion.
  5. Automatic activation-swap scheduler
    • Cost model predicts iteration time T_iter = T_fwd + T_back+opt where each term is max(compute, PCIe, SSD).
    • Search space: swap coefficient D_f (bytes of activations written per step).
    • Upper bound: limited by (a) free GPU memory; (b) overlap window T_max = T_bcomp − max(T_PCIe, T_SSD).
    • Priority order: layers ranked by “swap-benefit factor” SBF = FLOPs / SwapTime; linear_4h_to_h gets highest priority.
    • Iteratively increases D_f until predicted T_iter stops decreasing.

Implementation details

  • Written in pure PyTorch 2.0; uses CUDA events for cross-process sync.
  • Tested on 12× Intel P5510 SSDs (PCIe 4.0) but works with as few as 2; RAID not required.
  • Requires no GPUDirect-Storage; runs on consumer boards.

Experimental results

Hardware: 1× A100-80 GB or 1× RTX 4090 (24 GB), 768 GB DDR4, up to 12 NVMe SSDs.

  1. Maximum trainable size (batch = 1)
    • With 768 GB DRAM, Fuyou trains GPT-3-805 B on A100 and 276 B on RTX 4090.
    • ZeRO-Infinity tops out at 135 B (A100) and 135 B (4090) with same DRAM; fails at 65 B if DRAM < 512 GB.
  2. Throughput (TFLOPS, higher is better)
    • GPT-3-175 B, batch = 16 → 172 TFLOPS on A100 (86 % of peak), 87 TFLOPS on RTX 4090.
    • GPT-3-13 B, batch = 32 → 202 TFLOPS (A100) vs 59 TFLOPS (ZeRO-Offload), 45 TFLOPS (ZeRO-Infinity), 30 TFLOPS (Colossal-AI).
    • 3.4 × speed-up over ZeRO-Infinity on RTX 4090 (156 TFLOPS vs 45).
  3. Ablations
    • Removing backward/optimizer overlap cuts throughput by up to 38 %.
    • Disabling pipeline prefetch makes Fuyou only ~1.2–1.3× faster than ZeRO-Infinity; full pipeline raises this to 1.7–2.3×.
    • Auto-swap scheduler selects near-optimal D_f for batch sizes {32, 64, 80}, matching the empirical minimum iteration time.
  4. Cost-effectiveness (tokens / s / $)
    • Counting only compute + SSD hardware, Fuyou on 1 × 4090 + 6 SSDs delivers 1.7 × the tokens/s per dollar of a DGX-2 running Megatron-LM.
    • Whole-server cost (incl. CPU/motherboard) still reaches 75 % of DGX-2 cost-efficiency, despite using a single GPU.

Take-aways and limitations

  • NVMe SSDs are fast enough (~3–7 GB/s ea.) to act as an additional memory tier for fine-tuning if traffic is meticulously overlapped with compute.
  • Synchronising but overlapping an out-of-core optimizer avoids convergence issues of asynchronous updates while keeping the GPU busy.
  • Bottleneck shifts from DRAM to GPU memory once activations per layer exceed 24 GB-class GPUs; future work includes tensor-slicing or unified-memory tricks to push beyond 276 B on consumer cards.
  • Multi-GPU extension (pipeline parallel + Fuyou’s off-load) is left for future research.

Fuyou therefore demonstrates that terascale fine-tuning is no longer exclusive to data-centre hardware; with careful system design, a single reasonably priced workstation can train models an order of magnitude larger than previously possible.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Flashneuron: Ssd-enabled large-batch training of very deep neural networks. In 19th USENIX Conference on File and Storage Technologies (FAST 21), pages 387–401, 2021.
  2. Optimal gpu-cpu offloading strategies for deep neural network training. In European Conference on Parallel Processing, pages 151–166. Springer, 2020a.
  3. Optimal memory-aware backpropagation of deep join networks. Philosophical Transactions of the Royal Society A, 378(2166):20190049, 2020b.
  4. Efficient combination of rematerialization and offloading for training dnns. Advances in Neural Information Processing Systems, 34:23844–23857, 2021.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Parallel training of pre-trained models via chunk-based dynamic memory management. IEEE Transactions on Parallel and Distributed Systems, 34(1):304–315, 2022.
  10. J. Feng and D. Huang. Optimal gradient checkpoint search for arbitrary computation graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433–11442, 2021.
  11. Mobius: Fine tuning large-scale models on commodity gpu servers. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 489–501, 2023.
  12. How large language models will disrupt data management. Proceedings of the VLDB Endowment, 16(11):3302–3309, 2023.
  13. Memory-efficient backpropagation through time. Advances in neural information processing systems, 29, 2016.
  14. Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. arXiv preprint arXiv:1911.13214, 2019.
  15. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1341–1355, 2020.
  16. Elixir: Train a large language model on a small gpu cluster. arXiv preprint arXiv:2212.05339, 2022.
  17. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, 2020.
  18. Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures. ACM Transactions on Architecture and Code Optimization (TACO), 15(3):1–26, 2018.
  19. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  20. Dynamic tensor rematerialization. arXiv preprint arXiv:2006.09616, 2020.
  21. A graph theoretic framework of recomputation algorithms for memory-efficient backpropagation. Advances in Neural Information Processing Systems, 32, 2019.
  22. Tflms: Large model support in tensorflow by graph rewriting. arXiv preprint arXiv:1807.02037, 2018.
  23. Defines: Enabling fast exploration of the depth-first scheduling space for dnn accelerators through analytical modeling. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 570–583. IEEE, 2023.
  24. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. Proceedings of the VLDB Endowment, 16(3), 2022a.
  25. Het: Scaling out huge embedding model training via cache-enabled distributed framework. Proceedings of the VLDB Endowment, 15(2):312–320, 2022b.
  26. Microsoft. Megatron-deepspeed github repository, 2021. URL https://github.com/microsoft/Megatron-DeepSpeed.
  27. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384421. doi: 10.1145/3458817.3476209. URL https://doi.org/10.1145/3458817.3476209.
  28. Tsplit: Fine-grained gpu memory management for efficient dnn training via tensor splitting. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 2615–2628. IEEE, 2022.
  29. Angel-ptm: A scalable and economical large-scale pre-training system in tencent. arXiv preprint arXiv:2303.02868, 2023.
  30. NVIDIA. Nvidia nsight systems, 2018. URL https://developer.nvidia.com/nsight-systems.
  31. NVIDIA. Nvidia dgx-2, 2019. URL https://www.nvidia.cn/data-center/dgx-2/.
  32. NVIDIA. Nvidia a100, 2020. URL https://www.nvidia.com/en-us/data-center/a100/.
  33. NVIDIA. Geforce rtx 4090, 2022. URL https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/.
  34. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  35. Automatic differentiation in pytorch. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, 2017.
  36. POET: Training neural networks on tiny devices with integrated rematerialization and paging. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 17573–17583. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/patil22b.html.
  37. Capuchin: Tensor-based gpu memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 891–905, 2020.
  38. Training large neural networks with constant memory using a new execution algorithm. arXiv preprint arXiv:2002.05645, 2020.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  40. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  41. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021.
  42. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  43. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 598–611. IEEE, 2021a.
  44. {{\{{ZeRO-Offload}}\}}: Democratizing {{\{{Billion-Scale}}\}} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021b.
  45. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–13. IEEE, 2016.
  46. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  47. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
  48. Dynamic memory management for gpu-based training of deep neural networks. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 200–209. IEEE, 2019.
  49. Stronghold: fast and affordable billion-scale deep learning model training. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–17. IEEE, 2022.
  50. Supermicro. Supermicro sys-420gp-tnr dual xeon scalable 4u gpu superserver, 2023. URL https://store.supermicro.com/us_en/4u-gpu-superserver-sys-420gp-tnr.html.
  51. H.-A. Tech. Colossal examples, 2021. URL https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/gemini.
  52. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  54. I. Trummer. The case for nlp-enhanced database tuning: towards tuning tools that” read the manual”. Proceedings of the VLDB Endowment, 14(7):1159–1165, 2021.
  55. Superneurons: Dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pages 41–53, 2018.
  56. G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–410, 2023.
  57. Efficient memory management for gpu-based deep learning systems. arXiv preprint arXiv:1903.06631, 2019.
  58. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  59. Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URL https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin.
  60. Mpress: Democratizing billion-scale model training on multi-gpu servers via memory-saving inter-operator parallelism. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 556–569. IEEE, 2023.
  61. Str: Hybrid tensor re-generation to break memory wall for dnn training. IEEE Transactions on Parallel and Distributed Systems, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Changyue Liao (2 papers)
  2. Mo Sun (5 papers)
  3. Zihan Yang (15 papers)
  4. Kaiqi Chen (12 papers)
  5. Binhang Yuan (45 papers)
  6. Fei Wu (317 papers)
  7. Zeke Wang (17 papers)
Citations (2)