Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment (2403.10504v1)

Published 15 Mar 2024 in cs.DC and cs.SE
ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment

Abstract: The advent of the Transformer architecture has propelled the growth of NLP models, leading to remarkable achievements in numerous NLP tasks. Yet, the absence of specialized hardware like expansive GPU memory and high-speed interconnects poses challenges for training large-scale models. This makes it daunting for many users to experiment with pre-training and fine-tuning LLMs. In this study, we introduce \atom, a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting using cost-effective hardware, including consumer-grade GPUs and Ethernet. Unlike conventional model partitioning methods that distribute sub-models across GPUs, \atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput. Through static analysis, \atom identifies the best model partitioning strategy and flawlessly merges model execution with swapping. Key benefits of \atom include: Avoiding the central point of failure found in pipeline parallelism methods. Demonstrating superior performance and scalability compared to closely-integrated pipeline parallelism in slower networks. Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, \atom can enhance training efficiency up to $20 \times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.

Asynchronous Training of Massive Models in Decentralized Environments with Atom

Introduction to Atom

The continual growth of LLMs like GPT-3 necessitates an evolution in training methodologies, especially for entities lacking specialized hardware. Conventional distributed training approaches, while effective, demand substantial hardware resources and optimized network conditions, limiting access for a broader user base. Atom represents a novel approach, sidestepping these restrictions by facilitating the training of vast models in decentralized settings on cost-effective hardware. Unlike standard partitioning that distributes a model across GPUs, Atom proposes a model where each host (peer) accommodates a complete LLM through model swapping, optimizing the training process across multiple peers for enhanced throughput.

Challenges in LLM Training

The introduction of Transformer models has remarkably progressed the capabilities of deep neural networks, enabling groundbreaking successes in NLP. However, the training of these models, due to their enormity, requires substantial computational resources that surpass the development of conventional hardware. Training from scratch further accentuates the challenge, necessitating methodologies that allow the use of LLMs without resorting to massive accelerator farms.

Atom's Approach to Distributed Training

Atom's infrastructure diverges from existing model and pipeline parallelism by housing a complete model within a server's memory. This method, pioneering within the context of distributed LLM training, leverages memory swapping to facilitate model execution on singular GPUs. Atom's design prioritizes preventing GPU idleness, adeptly managing the trade-off between computation and memory swapping.

Characterization of GPT-3 for Atom

Critical to Atom's approach is a detailed profiling of the GPT-3 model to understand its memory and execution demands. Through profiling, Atom determines that even the most intensive layers of GPT-3 can be accommodated within a single consumer-grade GPU. This discovery underpins Atom's strategy of exploiting the individual operator/layer fitting within GPU memory, averting the need for extensive model partitioning.

Streamlining Memory Swapping

Atom addresses memory swapping's traditional overhead by establishing an optimal schedule that aligns model execution with swapping. This involves extending the forward propagation phase to match sub-model loading times, leveraging gradient accumulation. Particularly notable is Atom's handling of the embedding layer, a considerable but computationally minimal component, ensuring its efficient utilization without impeding performance.

Implementation Insights

Implemented on Pytorch and leveraging Hivemind for decentralized coordination, Atom encapsulates model tracing, partitioning, and compilation into a streamlined process. This process effectively divides the model into sub-models for independent training across peers, synchronizing through periodic allreduce communication.

Evaluation and Findings

Empirical assessments underscore Atom's superior performance in scenarios constrained by suboptimal network conditions, showcasing up to a 20x enhancement in training efficiency over decentralized pipeline parallelism methods. These evaluations also affirm Atom's scalability and effectiveness in maintaining convergence amidst dynamic changes, such as node failures or varying network conditions.

Concluding Remarks

Atom emerges as a robust framework for the asynchronous training of large-scale models within decentralized environments, mitigating the steep hardware requisites traditionally associated with such tasks. It not only demonstrates practical scalability and efficiency but also ensures model training effectiveness is kept intact, paving the way for broader accessibility to high-quality AI model training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’16), 2016, pp. 770–778.
  2. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in Proceedings of the Advances in neural information processing systems (NeurIPS’20), 2020.
  3. J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko et al., “Highly accurate protein structure prediction with alphafold,” in Nature, 2021.
  4. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” vol. 30, 2017.
  5. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  6. GitHub. (2022, Jun.) Github copilot. [Online]. Available: https://copilot.github.com/
  7. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in arXiv preprint arXiv:1810.04805, 2018.
  8. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” in OpenAI blog, 2019.
  9. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
  10. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” arXiv preprint arXiv:1910.10683, 2019.
  11. OpenAI. (2020, Jun.) Scaling language model training to a trillion parameters using megatron. [Online]. Available: https://openai.com/blog/openai-api/
  12. Meta. (2022, May) Democratizing access to large-scale language models with opt-175b. [Online]. Available: https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/
  13. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” in arXiv preprint arXiv:2205.01068, 2022.
  14. N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young et al., “Mesh-tensorflow: Deep learning for supercomputers,” in Proceedings of the Advances in neural information processing systems (NeurIPS’18), 2018.
  15. D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,” 2020.
  16. Y. Xu, H. Lee, D. Chen, B. Hechtman, Y. Huang, R. Joshi, M. Krikun, D. Lepikhin, A. Ly, M. Maggioni et al., “Gspmd: general and scalable parallelization for ml computation graphs,” in arXiv preprint arXiv:2105.04663, 2021.
  17. S. Athlur, N. Saran, M. Sivathanu, R. Ramjee, and N. Kwatra, “Varuna: scalable, low-cost training of massive deep learning models,” in Proceedings of the Seventeenth European Conference on Computer Systems (EuroSys’22), 2022.
  18. Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” vol. 32, 2019, pp. 103–112.
  19. D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: generalized pipeline parallelism for dnn training,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19), 2019.
  20. S. Eliad, I. Hakimi, A. De Jagger, M. Silberstein, and A. Schuster, “Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism,” in Proceedings of the USENIX Annual Technical Conference (ATC’21), 2021.
  21. Z. Li, S. Zhuang, S. Guo, D. Zhuo, H. Zhang, D. Song, and I. Stoica, “Terapipe: Token-level pipeline parallelism for training large-scale language models,” in Proceedings of the International Conference on Machine Learning (ICML’21), 2021.
  22. D. Team. (2021, Jun.) Deepspeed: Accelerating large-scale model inference and training via system optimizations and compression. [Online]. Available: https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/
  23. ——. (2021, Jun.) Zero-offload. [Online]. Available: https://www.deepspeed.ai/tutorials/zero-offload/
  24. Meta, “Introducing pytorch fully sharded data parallel (fsdp) api,” https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/, 2022.
  25. C.-C. Huang, G. Jin, and J. Li, “Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20), 2020.
  26. S. Choi, T. Kim, J. Jeong, R. Ausavarungnirun, M. Jeon, Y. Kwon, and J. Ahn, “Memory harvesting in {{\{{Multi-GPU}}\}} systems with hierarchical unified virtual memory,” in Proceedings of the 2022 USENIX Annual Technical Conference (ATC’22), 2022.
  27. M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design,” in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16).   IEEE, 2016, pp. 1–13.
  28. X. Peng, X. Shi, H. Dai, H. Jin, W. Ma, Q. Xiong, F. Yang, and X. Qian, “Capuchin: Tensor-based gpu memory management for deep learning,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020.
  29. L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, J. E. Gonzalez et al., “Alpa: Automating inter-and intra-operator parallelism for distributed deep learning,” in arXiv preprint arXiv:2201.12023, 2022.
  30. P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan et al., “Ray: A distributed framework for emerging {{\{{AI}}\}} applications,” in Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18), 2018.
  31. M. Ryabinin and A. Gusev, “Towards crowdsourced training of large neural networks using decentralized mixture-of-experts,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’20), 2020.
  32. M. Diskin, A. Bukhtiyarov, M. Ryabinin, L. Saulnier, Q. Lhoest, A. Sinitsin, D. Popov, D. Pyrkin, M. Kashirin, A. Borzunov, A. V. del Moral, D. Mazur, I. Kobelev, Y. Jernite, T. Wolf, and G. Pekhimenko, “Distributed deep learning in open collaborations,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’21), 2021. [Online]. Available: https://openreview.net/forum?id=FYHktcK-7v
  33. M. Ryabinin, T. Dettmers, M. Diskin, and A. Borzunov, “Swarm parallelism: Training large models can be surprisingly communication-efficient,” 2021.
  34. T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “Bloom: A 176b-parameter open-access multilingual language model,” 2022.
  35. L. team, “Hivemind: a Library for Decentralized Deep Learning,” https://github.com/learning-at-home/hivemind, 2020.
  36. A. Borzunov, D. Baranchuk, T. Dettmers, M. Ryabinin, Y. Belkada, A. Chumachenko, P. Samygin, and C. Raffel, “Petals: Collaborative inference and fine-tuning of large models,” arXiv preprint arXiv:2209.01188, 2022.
  37. S. University. (2022) National research cloud. [Online]. Available: https://hai.stanford.edu/policy/national-research-cloud
  38. D. P. Anderson, “Boinc: A system for public-resource computing and storage,” in Fifth IEEE/ACM international workshop on grid computing.   IEEE, 2004.
  39. EleutherAI. (2022) Eleutherai: A grassroots collective of researchers working to open source ai research. [Online]. Available: https://www.eleuther.ai/
  40. huggingface. (2022) Bigscience model training launched. [Online]. Available: https://bigscience.huggingface.co/blog/model-training-launched
  41. A. Borzunov, M. Ryabinin, T. Dettmers, Q. Lhoest, L. Saulnier, M. Diskin, Y. Jernite, and T. Wolf, “Training transformers together,” 2022.
  42. N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” in arXiv preprint arXiv:1701.06538, 2017.
  43. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in arXiv preprint arXiv:1909.11942, 2019.
  44. A. Sergeev and M. D. Balso, “Horovod: fast and easy distributed deep learning in TensorFlow,” arXiv preprint arXiv:1802.05799, 2018.
  45. Y. Jiang, Y. Zhu, C. Lan, B. Yi, Y. Cui, and C. Guo, “A unified architecture for accelerating distributed {{\{{DNN}}\}} training in heterogeneous gpu/cpu clusters,” in Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20), 2020, pp. 463–479.
  46. S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20), 2020.
  47. S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning,” 2021.
  48. NVIDIA. (2022) Nvlink and nvswitch - the building blocks of advanced multi-gpu communication—within and between servers. [Online]. Available: https://www.nvidia.com/en-us/data-center/nvlink/
  49. Z. Jia, S. Lin, C. R. Qi, and A. Aiken, “Exploring hidden dimensions in accelerating convolutional neural networks,” in Proceedings of the International Conference on Machine Learning (PMLR’18), 2018.
  50. Google, “grpc: A high performance, open source universal rpc framework,” https://grpc.io/, 2022.
  51. H. Wang, D. Niu, and B. Li, “Dynamic and decentralized global analytics via machine learning,” in Proceedings of the ACM Symposium on Cloud Computing (SoCC’28), 2018.
  52. R. Biswas, X. Lu, and D. K. Panda, “Designing a micro-benchmark suite to evaluate grpc for tensorflow: Early experiences,” 2018.
  53. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in arXiv preprint arXiv:1711.05101, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xiaofeng Wu (29 papers)
  2. Jia Rao (3 papers)
  3. Wei Chen (1288 papers)
Citations (1)