Papers
Topics
Authors
Recent
2000 character limit reached

LoHan: Low-Cost High-Performance Framework to Fine-Tune 100B Model on a Consumer GPU

Published 11 Mar 2024 in cs.DC | (2403.06504v2)

Abstract: Nowadays, AI researchers become more and more interested in fine-tuning a pre-trained LLM, whose size has grown to up to over 100B parameters, for their downstream tasks. One approach to fine-tune such huge models is to aggregate device memory from many GPUs. However, this approach introduces prohibitive costs for most data scientists with a limited budget for high-end GPU servers. In this paper, we focus on LLM fine-tuning on a single consumer-grade GPU in a commodity server with limited main memory capacity, which is accessible to most AI researchers. In such a scenario, existing offloading-based methods fail to fine-tune an LLM efficiently due to a lack of holistic intra-server tensor movement management. To this end, we present LoHan, a low-cost, high-performance deep learning training framework that enables efficient 100B-scale model fine-tuning on a commodity server with a consumer-grade GPU and limited main memory capacity. The key idea is to add holistic offloading traffic as an optimization dimension for 1)active gradient offloading, and 2)holistic traffic-aware activation swapping mechanism. The experimental results show that 1)LoHan is the first to fine-tune a 175B model on an RTX 4090 and 256 GB main memory, 2)LoHan achieves 2.32x throughput than the state-of-the-art baselines when fine-tuning a small 13B model, and 3)LoHan enables a cheap low-end consumer GPU to have higher cost-effectiveness than a DGX-A100 cluster when fine-tuning a 175B model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Flashneuron: Ssd-enabled large-batch training of very deep neural networks. In 19th USENIX Conference on File and Storage Technologies (FAST 21), pages 387–401, 2021.
  2. Optimal gpu-cpu offloading strategies for deep neural network training. In European Conference on Parallel Processing, pages 151–166. Springer, 2020a.
  3. Optimal memory-aware backpropagation of deep join networks. Philosophical Transactions of the Royal Society A, 378(2166):20190049, 2020b.
  4. Efficient combination of rematerialization and offloading for training dnns. Advances in Neural Information Processing Systems, 34:23844–23857, 2021.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Parallel training of pre-trained models via chunk-based dynamic memory management. IEEE Transactions on Parallel and Distributed Systems, 34(1):304–315, 2022.
  10. J. Feng and D. Huang. Optimal gradient checkpoint search for arbitrary computation graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433–11442, 2021.
  11. Mobius: Fine tuning large-scale models on commodity gpu servers. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 489–501, 2023.
  12. How large language models will disrupt data management. Proceedings of the VLDB Endowment, 16(11):3302–3309, 2023.
  13. Memory-efficient backpropagation through time. Advances in neural information processing systems, 29, 2016.
  14. Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. arXiv preprint arXiv:1911.13214, 2019.
  15. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1341–1355, 2020.
  16. Elixir: Train a large language model on a small gpu cluster. arXiv preprint arXiv:2212.05339, 2022.
  17. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, 2020.
  18. Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures. ACM Transactions on Architecture and Code Optimization (TACO), 15(3):1–26, 2018.
  19. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  20. Dynamic tensor rematerialization. arXiv preprint arXiv:2006.09616, 2020.
  21. A graph theoretic framework of recomputation algorithms for memory-efficient backpropagation. Advances in Neural Information Processing Systems, 32, 2019.
  22. Tflms: Large model support in tensorflow by graph rewriting. arXiv preprint arXiv:1807.02037, 2018.
  23. Defines: Enabling fast exploration of the depth-first scheduling space for dnn accelerators through analytical modeling. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 570–583. IEEE, 2023.
  24. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. Proceedings of the VLDB Endowment, 16(3), 2022a.
  25. Het: Scaling out huge embedding model training via cache-enabled distributed framework. Proceedings of the VLDB Endowment, 15(2):312–320, 2022b.
  26. Microsoft. Megatron-deepspeed github repository, 2021. URL https://github.com/microsoft/Megatron-DeepSpeed.
  27. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384421. doi: 10.1145/3458817.3476209. URL https://doi.org/10.1145/3458817.3476209.
  28. Tsplit: Fine-grained gpu memory management for efficient dnn training via tensor splitting. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 2615–2628. IEEE, 2022.
  29. Angel-ptm: A scalable and economical large-scale pre-training system in tencent. arXiv preprint arXiv:2303.02868, 2023.
  30. NVIDIA. Nvidia nsight systems, 2018. URL https://developer.nvidia.com/nsight-systems.
  31. NVIDIA. Nvidia dgx-2, 2019. URL https://www.nvidia.cn/data-center/dgx-2/.
  32. NVIDIA. Nvidia a100, 2020. URL https://www.nvidia.com/en-us/data-center/a100/.
  33. NVIDIA. Geforce rtx 4090, 2022. URL https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/.
  34. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  35. Automatic differentiation in pytorch. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, 2017.
  36. POET: Training neural networks on tiny devices with integrated rematerialization and paging. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 17573–17583. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/patil22b.html.
  37. Capuchin: Tensor-based gpu memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 891–905, 2020.
  38. Training large neural networks with constant memory using a new execution algorithm. arXiv preprint arXiv:2002.05645, 2020.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  40. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  41. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021.
  42. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  43. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 598–611. IEEE, 2021a.
  44. {{\{{ZeRO-Offload}}\}}: Democratizing {{\{{Billion-Scale}}\}} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021b.
  45. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–13. IEEE, 2016.
  46. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  47. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
  48. Dynamic memory management for gpu-based training of deep neural networks. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 200–209. IEEE, 2019.
  49. Stronghold: fast and affordable billion-scale deep learning model training. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–17. IEEE, 2022.
  50. Supermicro. Supermicro sys-420gp-tnr dual xeon scalable 4u gpu superserver, 2023. URL https://store.supermicro.com/us_en/4u-gpu-superserver-sys-420gp-tnr.html.
  51. H.-A. Tech. Colossal examples, 2021. URL https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/gemini.
  52. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  54. I. Trummer. The case for nlp-enhanced database tuning: towards tuning tools that” read the manual”. Proceedings of the VLDB Endowment, 14(7):1159–1165, 2021.
  55. Superneurons: Dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pages 41–53, 2018.
  56. G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–410, 2023.
  57. Efficient memory management for gpu-based deep learning systems. arXiv preprint arXiv:1903.06631, 2019.
  58. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  59. Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URL https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin.
  60. Mpress: Democratizing billion-scale model training on multi-gpu servers via memory-saving inter-operator parallelism. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 556–569. IEEE, 2023.
  61. Str: Hybrid tensor re-generation to break memory wall for dnn training. IEEE Transactions on Parallel and Distributed Systems, 2023.
Citations (2)

Summary

  • The paper introduces Fuyou, a PyTorch framework that uses NVMe SSDs as an extra memory tier to fine-tune models up to 100B parameters on a single GPU.
  • The paper demonstrates a novel method overlapping GPU compute, PCIe, and SSD I/O that achieves up to 1.7× speed improvements and high TFLOPS performance.
  • The paper shows that careful resource orchestration enables cost-effective, large-scale model training on affordable hardware, challenging data-center exclusivity.

Large-scale fine-tuning typically demands clusters of expensive GPUs because the optimizer states and activations of 100 B-parameter models exceed the 80 GB ceiling of today’s biggest devices. “Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU” introduces Fuyou, a PyTorch-based training framework that turns an off-the-shelf workstation (e.g., one RTX 4090, ≤768 GB DRAM, a handful of NVMe SSDs) into a platform capable of high-throughput fine-tuning of models as large as GPT-3-175 B—something state-of-the-art engines such as ZeRO-Infinity or Colossal-AI cannot do on similar hardware.


Key problems with existing off-loading systems

  1. CPU-memory bottleneck – ZeRO-Infinity keeps activations in DRAM; with ≤512 GB host RAM it cannot even fit 65 B parameters.
  2. Idle GPU time – ZeRO-Infinity performs out-of-core Adam after backward, leaving the GPU idle for 40 – 70 % of each step and achieving <30 % utilisation on a single A100.

Fuyou: system overview

The central idea is to treat SSD ⇆ CPU traffic as a first-class optimisation axis and co-design swapping, prefetching and computation so that all five resources—GPU compute, CPU compute, GPU↔CPU PCIe, CPU↔SSD bandwidth, SSD I/O—remain simultaneously busy.

Component pipeline

  1. Profiling phase
    • Runs one step with everything swapped to SSD; records per-layer FLOPs, activation/parameter sizes, GPU/CPU times and link bandwidths.
  2. GPU-aware FIFO prefetcher
    • Uses remaining GPU memory (after model/mini-batch allocations) as a sliding window buffer; pulls next parameters/activations as soon as space frees up, enabling continuous overlap of PCIe, SSD I/O and kernels.
  3. Synchronous out-of-core CPU Adam overlapped with backward
    • Gradients streamed from GPU are immediately consumed by a separate CPU process; weight updates are synchronous (no delayed update, so convergence unchanged) yet hidden behind subsequent layer back-props.
    • Delayed write-back: reading group i+1 from SSD overlaps writing group i, maximising SSD duplex bandwidth.
  4. GPU-CPU-SSD three-level activation swapping
    • Activations initially leave GPU to DRAM; if DRAM pressure arises they are flushed to SSDs in a fully pipelined fashion.
  5. Automatic activation-swap scheduler
    • Cost model predicts iteration time T_iter = T_fwd + T_back+opt where each term is max(compute, PCIe, SSD).
    • Search space: swap coefficient D_f (bytes of activations written per step).
    • Upper bound: limited by (a) free GPU memory; (b) overlap window T_max = T_bcomp − max(T_PCIe, T_SSD).
    • Priority order: layers ranked by “swap-benefit factor” SBF = FLOPs / SwapTime; linear_4h_to_h gets highest priority.
    • Iteratively increases D_f until predicted T_iter stops decreasing.

Implementation details

  • Written in pure PyTorch 2.0; uses CUDA events for cross-process sync.
  • Tested on 12× Intel P5510 SSDs (PCIe 4.0) but works with as few as 2; RAID not required.
  • Requires no GPUDirect-Storage; runs on consumer boards.

Experimental results

Hardware: 1× A100-80 GB or 1× RTX 4090 (24 GB), 768 GB DDR4, up to 12 NVMe SSDs.

  1. Maximum trainable size (batch = 1)
    • With 768 GB DRAM, Fuyou trains GPT-3-805 B on A100 and 276 B on RTX 4090.
    • ZeRO-Infinity tops out at 135 B (A100) and 135 B (4090) with same DRAM; fails at 65 B if DRAM < 512 GB.
  2. Throughput (TFLOPS, higher is better)
    • GPT-3-175 B, batch = 16 → 172 TFLOPS on A100 (86 % of peak), 87 TFLOPS on RTX 4090.
    • GPT-3-13 B, batch = 32 → 202 TFLOPS (A100) vs 59 TFLOPS (ZeRO-Offload), 45 TFLOPS (ZeRO-Infinity), 30 TFLOPS (Colossal-AI).
    • 3.4 × speed-up over ZeRO-Infinity on RTX 4090 (156 TFLOPS vs 45).
  3. Ablations
    • Removing backward/optimizer overlap cuts throughput by up to 38 %.
    • Disabling pipeline prefetch makes Fuyou only ~1.2–1.3× faster than ZeRO-Infinity; full pipeline raises this to 1.7–2.3×.
    • Auto-swap scheduler selects near-optimal D_f for batch sizes {32, 64, 80}, matching the empirical minimum iteration time.
  4. Cost-effectiveness (tokens / s / $)
    • Counting only compute + SSD hardware, Fuyou on 1 × 4090 + 6 SSDs delivers 1.7 × the tokens/s per dollar of a DGX-2 running Megatron-LM.
    • Whole-server cost (incl. CPU/motherboard) still reaches 75 % of DGX-2 cost-efficiency, despite using a single GPU.

Take-aways and limitations

  • NVMe SSDs are fast enough (~3–7 GB/s ea.) to act as an additional memory tier for fine-tuning if traffic is meticulously overlapped with compute.
  • Synchronising but overlapping an out-of-core optimizer avoids convergence issues of asynchronous updates while keeping the GPU busy.
  • Bottleneck shifts from DRAM to GPU memory once activations per layer exceed 24 GB-class GPUs; future work includes tensor-slicing or unified-memory tricks to push beyond 276 B on consumer cards.
  • Multi-GPU extension (pipeline parallel + Fuyou’s off-load) is left for future research.

Fuyou therefore demonstrates that terascale fine-tuning is no longer exclusive to data-centre hardware; with careful system design, a single reasonably priced workstation can train models an order of magnitude larger than previously possible.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 14 tweets with 524 likes about this paper.