LoHan: Low-Cost High-Performance Framework to Fine-Tune 100B Model on a Consumer GPU
Abstract: Nowadays, AI researchers become more and more interested in fine-tuning a pre-trained LLM, whose size has grown to up to over 100B parameters, for their downstream tasks. One approach to fine-tune such huge models is to aggregate device memory from many GPUs. However, this approach introduces prohibitive costs for most data scientists with a limited budget for high-end GPU servers. In this paper, we focus on LLM fine-tuning on a single consumer-grade GPU in a commodity server with limited main memory capacity, which is accessible to most AI researchers. In such a scenario, existing offloading-based methods fail to fine-tune an LLM efficiently due to a lack of holistic intra-server tensor movement management. To this end, we present LoHan, a low-cost, high-performance deep learning training framework that enables efficient 100B-scale model fine-tuning on a commodity server with a consumer-grade GPU and limited main memory capacity. The key idea is to add holistic offloading traffic as an optimization dimension for 1)active gradient offloading, and 2)holistic traffic-aware activation swapping mechanism. The experimental results show that 1)LoHan is the first to fine-tune a 175B model on an RTX 4090 and 256 GB main memory, 2)LoHan achieves 2.32x throughput than the state-of-the-art baselines when fine-tuning a small 13B model, and 3)LoHan enables a cheap low-end consumer GPU to have higher cost-effectiveness than a DGX-A100 cluster when fine-tuning a 175B model.
- Flashneuron: Ssd-enabled large-batch training of very deep neural networks. In 19th USENIX Conference on File and Storage Technologies (FAST 21), pages 387–401, 2021.
- Optimal gpu-cpu offloading strategies for deep neural network training. In European Conference on Parallel Processing, pages 151–166. Springer, 2020a.
- Optimal memory-aware backpropagation of deep join networks. Philosophical Transactions of the Royal Society A, 378(2166):20190049, 2020b.
- Efficient combination of rematerialization and offloading for training dnns. Advances in Neural Information Processing Systems, 34:23844–23857, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Parallel training of pre-trained models via chunk-based dynamic memory management. IEEE Transactions on Parallel and Distributed Systems, 34(1):304–315, 2022.
- J. Feng and D. Huang. Optimal gradient checkpoint search for arbitrary computation graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433–11442, 2021.
- Mobius: Fine tuning large-scale models on commodity gpu servers. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 489–501, 2023.
- How large language models will disrupt data management. Proceedings of the VLDB Endowment, 16(11):3302–3309, 2023.
- Memory-efficient backpropagation through time. Advances in neural information processing systems, 29, 2016.
- Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. arXiv preprint arXiv:1911.13214, 2019.
- Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1341–1355, 2020.
- Elixir: Train a large language model on a small gpu cluster. arXiv preprint arXiv:2212.05339, 2022.
- Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, 2020.
- Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures. ACM Transactions on Architecture and Code Optimization (TACO), 15(3):1–26, 2018.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Dynamic tensor rematerialization. arXiv preprint arXiv:2006.09616, 2020.
- A graph theoretic framework of recomputation algorithms for memory-efficient backpropagation. Advances in Neural Information Processing Systems, 32, 2019.
- Tflms: Large model support in tensorflow by graph rewriting. arXiv preprint arXiv:1807.02037, 2018.
- Defines: Enabling fast exploration of the depth-first scheduling space for dnn accelerators through analytical modeling. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 570–583. IEEE, 2023.
- Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. Proceedings of the VLDB Endowment, 16(3), 2022a.
- Het: Scaling out huge embedding model training via cache-enabled distributed framework. Proceedings of the VLDB Endowment, 15(2):312–320, 2022b.
- Microsoft. Megatron-deepspeed github repository, 2021. URL https://github.com/microsoft/Megatron-DeepSpeed.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384421. doi: 10.1145/3458817.3476209. URL https://doi.org/10.1145/3458817.3476209.
- Tsplit: Fine-grained gpu memory management for efficient dnn training via tensor splitting. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 2615–2628. IEEE, 2022.
- Angel-ptm: A scalable and economical large-scale pre-training system in tencent. arXiv preprint arXiv:2303.02868, 2023.
- NVIDIA. Nvidia nsight systems, 2018. URL https://developer.nvidia.com/nsight-systems.
- NVIDIA. Nvidia dgx-2, 2019. URL https://www.nvidia.cn/data-center/dgx-2/.
- NVIDIA. Nvidia a100, 2020. URL https://www.nvidia.com/en-us/data-center/a100/.
- NVIDIA. Geforce rtx 4090, 2022. URL https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Automatic differentiation in pytorch. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, 2017.
- POET: Training neural networks on tiny devices with integrated rematerialization and paging. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 17573–17583. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/patil22b.html.
- Capuchin: Tensor-based gpu memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 891–905, 2020.
- Training large neural networks with constant memory using a new execution algorithm. arXiv preprint arXiv:2002.05645, 2020.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 598–611. IEEE, 2021a.
- {{\{{ZeRO-Offload}}\}}: Democratizing {{\{{Billion-Scale}}\}} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021b.
- vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–13. IEEE, 2016.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- Dynamic memory management for gpu-based training of deep neural networks. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 200–209. IEEE, 2019.
- Stronghold: fast and affordable billion-scale deep learning model training. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–17. IEEE, 2022.
- Supermicro. Supermicro sys-420gp-tnr dual xeon scalable 4u gpu superserver, 2023. URL https://store.supermicro.com/us_en/4u-gpu-superserver-sys-420gp-tnr.html.
- H.-A. Tech. Colossal examples, 2021. URL https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/gemini.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- I. Trummer. The case for nlp-enhanced database tuning: towards tuning tools that” read the manual”. Proceedings of the VLDB Endowment, 14(7):1159–1165, 2021.
- Superneurons: Dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pages 41–53, 2018.
- G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–410, 2023.
- Efficient memory management for gpu-based deep learning systems. arXiv preprint arXiv:1903.06631, 2019.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URL https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin.
- Mpress: Democratizing billion-scale model training on multi-gpu servers via memory-saving inter-operator parallelism. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 556–569. IEEE, 2023.
- Str: Hybrid tensor re-generation to break memory wall for dnn training. IEEE Transactions on Parallel and Distributed Systems, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.