HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models (2405.16256v2)
Abstract: Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of GPU-accelerator. Using multiple types of GPU-accelerators to construct a large-scale cluster is an effective way to solve the problem of insufficient homogeneous GPU-accelerators. However, the existing distributed training systems for large-scale models only support homogeneous GPU-accelerators, not support heterogeneous GPU-accelerators. To address the problem, this paper proposes a distributed training system with hybrid parallelism, HETHUB, for large-scale models, which supports heterogeneous cluster, including AMD, Nvidia GPU and other types of GPU-accelerators . It introduces a distributed unified communicator to realize the communication between heterogeneous GPU-accelerators, a distributed performance predictor, and an automatic parallel planner to develop and train models efficiently with heterogeneous GPU-accelerators. Compared to the distributed training system with homogeneous GPU-accelerators, our system can support six combinations of heterogeneous GPU-accelerators. We train the Llama-140B model on a heterogeneous cluster with 768 GPU-accelerators(128 AMD and 640 GPU-accelerator A). The experiment results show that the optimal performance of our system in the heterogeneous cluster has achieved up to 97.49% of the theoretical upper bound performance.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Accurate medium-range global weather forecasting with 3d neural networks. Nature, 619(7970):533–538, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- A bi-layered parallel training architecture for large-scale convolutional neural networks. IEEE transactions on parallel and distributed systems, 30(5):965–976, 2018.
- Chinese tiny llm: Pretraining a chinese-centric large language model, 2024.
- All nlp tasks are generation tasks: A general pretraining framework, 2021.
- Adaptive subgradient methods for online learning and stochastic optimization. pages 257–269, 2011.
- Linpack evaluation on a supercomputer with heterogeneous accelerators. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pages 1–8. IEEE, 2010.
- Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021.
- Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
- Performance prediction for distributed graph computing. In 2019 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS), pages 7–13, 2019. doi: 10.1109/HPBDIS.2019.8735449.
- Whale: Efficient giant model training over heterogeneous {{\{{GPUs}}\}}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 673–688, 2022.
- Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems, 1:1–13, 2019.
- Using rdma efficiently for key-value services. ACM SIGCOMM Computer Communication Review, 44(4):295–306, 2014.
- Vivek Kashyap. Ip over infiniband (ipoib) architecture. RFC 4392, April 2006. URL https://www.rfc-editor.org/info/rfc4392.
- Automatic data layout for distributed-memory machines. ACM Transactions on Programming Languages and Systems (TOPLAS), 20(4):869–916, 1998.
- Konrad Kułakowski. 245 concurrent systems modeling with ccl. Automatyka/Automatics, 16:2, 01 2012. doi: 10.7494/automat.2012.16.2.115.
- Unified distributed environment. arXiv preprint arXiv:2205.06946, 2022.
- Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on operating systems design and implementation (OSDI 14), pages 583–598, 2014.
- Pytorch distributed: Experiences on accelerating data parallel training. 2020.
- Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021.
- A survey on auto-parallelism of large-scale deep learning training. IEEE Transactions on Parallel and Distributed Systems, 2023.
- M6: A chinese multimodal pretrainer. arXiv preprint arXiv:2103.00823, 2021.
- Prague: High-performance heterogeneity-aware asynchronous decentralized training. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 401–416, 2020.
- A scalable hierarchical distributed language model. Advances in neural information processing systems, 21, 2008.
- Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Zero: Memory optimization towards training A trillion parameter models. CoRR, abs/1910.02054, 2019. URL http://arxiv.org/abs/1910.02054.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- Distir: An intermediate representation for optimizing distributed neural networks. In Proceedings of the 1st Workshop on Machine Learning and Systems, pages 15–23, 2021.
- Automap: Towards ergonomic automated parallelism for ml models. arXiv preprint arXiv:2112.02958, 2021.
- Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018. URL http://arxiv.org/abs/1802.05799.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
- Dolma: an open corpus of three trillion tokens for language model pretraining research, 2024.
- Accpar: Tensor partitioning for heterogeneous deep learning accelerators. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 342–355. IEEE, 2020.
- Piper: Multidimensional planner for dnn parallelization. Advances in Neural Information Processing Systems, 34:24829–24840, 2021.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Efficient and systematic partitioning of large and deep neural networks for parallelization. In Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings 27, pages 201–216. Springer, 2021.
- Pangu-α𝛼\alphaitalic_α: Large-scale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369, 2021.
- A novel automatic modulation classification method using attention mechanism and hybrid parallel neural network. Applied Sciences, 11(3):1327, 2021.
- Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022.
- Si Xu (1 paper)
- Zixiao Huang (7 papers)
- Yan Zeng (46 papers)
- Shengen Yan (26 papers)
- Xuefei Ning (52 papers)
- Haolin Ye (7 papers)
- Sipei Gu (1 paper)
- Chunsheng Shui (1 paper)
- Zhezheng Lin (1 paper)
- Hao Zhang (948 papers)
- Sheng Wang (239 papers)
- Guohao Dai (51 papers)
- Yu Wang (939 papers)
- Quanlu Zhang (14 papers)