Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading (2410.21316v1)
Abstract: Transformers and LLMs~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a ``memory wall'', i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structures (model parameters, optimizer state, gradients, activations) in GPU memory. To compensate, state-of-the-art approaches offload the optimizer state, at least partially, to the host memory and perform hybrid CPU-GPU computations. However, the management of the combined host-GPU memory is often suboptimal and results in poor overlapping between data movements and computations. This leads to missed opportunities to simultaneously leverage the interconnect bandwidth and computational capabilities of CPUs and GPUs. In this paper, we leverage a key observation that the interleaving of the forward, backward and update phases generate fluctuations in the GPU memory utilization, which can be exploited to dynamically move a part of the optimizer state between the host and the GPU memory at each iteration. To this end, we design and implement \proj, a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU based on our proposed performance model that addresses the trade-off between data movement cost, acceleration on the GPUs vs the CPUs, and competition for shared resources. We integrate our approach with DeepSpeed and demonstrate 2.5$\times$ faster iterations over state-of-the-art approaches using extensive experiments.
- Canary: Fault-tolerant faas for stateful time-sensitive applications. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, Dallas, TX, USA, 2022. IEEE.
- Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
- modnn: Memory optimal dnn training on gpus. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 13–18. IEEE, 2018.
- Training 175b parameter language models at 1000 gpu scale with alpa and ray. https://www.anyscale.com/blog/training-175b-parameter-language-models-at-1000-gpu-scale-with-alpa-and-ray, 2023.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Dapple: A pipelined data parallel approach for training large models. In Proc. of the SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021.
- Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(1), jan 2022.
- Alex Graves. Generating sequences with recurrent neural networks, 2014.
- Xpipe: Efficient pipeline model parallelism for multi-gpu dnn training, 2020.
- Autotm: Automatic tensor movement in heterogeneous memory systems using integer linear programming. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2020.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
- HuggingFace. Nanotron: Minimalistic large language model 3d-parallelism training. https://github.com/huggingface/nanotron.
- Whale: Efficient giant model training over heterogeneous {{\{{GPUs}}\}}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 673–688, 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020.
- Cotrain: Efficient scheduling for large-model training upon gpu and cpu in parallel. In Proceedings of the 52nd International Conference on Parallel Processing, pages 92–101, 2023.
- M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining, 2021.
- Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing. In 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), 2021.
- Gpu-enabled asynchronous multi-level checkpoint caching and prefetching. In HPDC’23: The 32nd International Symposium on High-Performance Parallel and Distributed Computing, Orlando, USA, 2023.
- DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models. In Proc. of the International Symposium on High-Performance Parallel and Distributed Computing, HPDC’24, 2024.
- Breaking the memory wall: A study of i/o patterns and gpu memory utilization for hybrid cpu-gpu offloaded optimizers. In FlexScience’24: Workshop on AI & Scientific Computing at Scale using Flexible Comp. Infrastructures, 2024.
- Mixed precision training, 2018.
- Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019.
- Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models. In CGrid’20: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, Melbourne, Australia, 2020.
- Nvidia. NVIDIA Management Library (NVML). https://developer.nvidia.com/management-library-nvml.
- Capuchin: Tensor-based gpu memory management for deep learning. In The 25th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 891–905, 2020.
- Pytorch. Gradients accumulation-pytorch. https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3, 2019.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- Zero-infinity: breaking the gpu memory wall for extreme scale deep learning. In SC’21: The 2021 International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, USA, 2021.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- {{\{{Zero-offload}}\}}: Democratizing {{\{{billion-scale}}\}} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021.
- Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model, 2022.
- DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies, 2023.
- Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.
- Deepspeed zero-offload++: 6x higher training throughput via collaborative cpu/gpu twin-flow, 2023.
- Superneurons: Dynamic gpu memory management for training deep neural networks. In Proc. of the ACM SIGPLAN symposium on principles and practice of parallel programming, 2018.
- HPC Wire. Training of 1-trillion parameter ai begins. https://www.hpcwire.com/2023/11/13/training-of-1-trillion-parameter-scientific-ai-begins/, 2023.
- BigScience Workshop. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model, 2023.
- Understanding the performance and estimating the cost of llm fine-tuning. arXiv preprint arXiv:2408.04693, 2024.
- Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis & Machine Intelligence, 45(10), 2023.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Distributed training of large language models. In 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), pages 840–847. IEEE, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Scalefold: Reducing alphafold initial training time to 10 hours, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.