Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models (2405.16256v2)

Published 25 May 2024 in cs.DC and cs.AI

Abstract: Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of GPU-accelerator. Using multiple types of GPU-accelerators to construct a large-scale cluster is an effective way to solve the problem of insufficient homogeneous GPU-accelerators. However, the existing distributed training systems for large-scale models only support homogeneous GPU-accelerators, not support heterogeneous GPU-accelerators. To address the problem, this paper proposes a distributed training system with hybrid parallelism, HETHUB, for large-scale models, which supports heterogeneous cluster, including AMD, Nvidia GPU and other types of GPU-accelerators . It introduces a distributed unified communicator to realize the communication between heterogeneous GPU-accelerators, a distributed performance predictor, and an automatic parallel planner to develop and train models efficiently with heterogeneous GPU-accelerators. Compared to the distributed training system with homogeneous GPU-accelerators, our system can support six combinations of heterogeneous GPU-accelerators. We train the Llama-140B model on a heterogeneous cluster with 768 GPU-accelerators(128 AMD and 640 GPU-accelerator A). The experiment results show that the optimal performance of our system in the heterogeneous cluster has achieved up to 97.49% of the theoretical upper bound performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Accurate medium-range global weather forecasting with 3d neural networks. Nature, 619(7970):533–538, 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. A bi-layered parallel training architecture for large-scale convolutional neural networks. IEEE transactions on parallel and distributed systems, 30(5):965–976, 2018.
  5. Chinese tiny llm: Pretraining a chinese-centric large language model, 2024.
  6. All nlp tasks are generation tasks: A general pretraining framework, 2021.
  7. Adaptive subgradient methods for online learning and stochastic optimization. pages 257–269, 2011.
  8. Linpack evaluation on a supercomputer with heterogeneous accelerators. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pages 1–8. IEEE, 2010.
  9. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021.
  10. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018.
  11. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  12. Performance prediction for distributed graph computing. In 2019 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS), pages 7–13, 2019. doi: 10.1109/HPBDIS.2019.8735449.
  13. Whale: Efficient giant model training over heterogeneous {{\{{GPUs}}\}}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 673–688, 2022.
  14. Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems, 1:1–13, 2019.
  15. Using rdma efficiently for key-value services. ACM SIGCOMM Computer Communication Review, 44(4):295–306, 2014.
  16. Vivek Kashyap. Ip over infiniband (ipoib) architecture. RFC 4392, April 2006. URL https://www.rfc-editor.org/info/rfc4392.
  17. Automatic data layout for distributed-memory machines. ACM Transactions on Programming Languages and Systems (TOPLAS), 20(4):869–916, 1998.
  18. Konrad Kułakowski. 245 concurrent systems modeling with ccl. Automatyka/Automatics, 16:2, 01 2012. doi: 10.7494/automat.2012.16.2.115.
  19. Unified distributed environment. arXiv preprint arXiv:2205.06946, 2022.
  20. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on operating systems design and implementation (OSDI 14), pages 583–598, 2014.
  21. Pytorch distributed: Experiences on accelerating data parallel training. 2020.
  22. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021.
  23. A survey on auto-parallelism of large-scale deep learning training. IEEE Transactions on Parallel and Distributed Systems, 2023.
  24. M6: A chinese multimodal pretrainer. arXiv preprint arXiv:2103.00823, 2021.
  25. Prague: High-performance heterogeneity-aware asynchronous decentralized training. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 401–416, 2020.
  26. A scalable hierarchical distributed language model. Advances in neural information processing systems, 21, 2008.
  27. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
  28. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Zero: Memory optimization towards training A trillion parameter models. CoRR, abs/1910.02054, 2019. URL http://arxiv.org/abs/1910.02054.
  31. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  32. Distir: An intermediate representation for optimizing distributed neural networks. In Proceedings of the 1st Workshop on Machine Learning and Systems, pages 15–23, 2021.
  33. Automap: Towards ergonomic automated parallelism for ml models. arXiv preprint arXiv:2112.02958, 2021.
  34. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018. URL http://arxiv.org/abs/1802.05799.
  35. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  36. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  37. Dolma: an open corpus of three trillion tokens for language model pretraining research, 2024.
  38. Accpar: Tensor partitioning for heterogeneous deep learning accelerators. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 342–355. IEEE, 2020.
  39. Piper: Multidimensional planner for dnn parallelization. Advances in Neural Information Processing Systems, 34:24829–24840, 2021.
  40. Llama 2: Open foundation and fine-tuned chat models, 2023.
  41. Efficient and systematic partitioning of large and deep neural networks for parallelization. In Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings 27, pages 201–216. Springer, 2021.
  42. Pangu-α𝛼\alphaitalic_α: Large-scale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369, 2021.
  43. A novel automatic modulation classification method using attention mechanism and hybrid parallel neural network. Applied Sciences, 11(3):1327, 2021.
  44. Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Si Xu (1 paper)
  2. Zixiao Huang (7 papers)
  3. Yan Zeng (46 papers)
  4. Shengen Yan (26 papers)
  5. Xuefei Ning (52 papers)
  6. Haolin Ye (7 papers)
  7. Sipei Gu (1 paper)
  8. Chunsheng Shui (1 paper)
  9. Zhezheng Lin (1 paper)
  10. Hao Zhang (948 papers)
  11. Sheng Wang (239 papers)
  12. Guohao Dai (51 papers)
  13. Yu Wang (939 papers)
  14. Quanlu Zhang (14 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com