HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models (2405.16256v2)

Published 25 May 2024 in cs.DC and cs.AI

Abstract: Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of GPU-accelerator. Using multiple types of GPU-accelerators to construct a large-scale cluster is an effective way to solve the problem of insufficient homogeneous GPU-accelerators. However, the existing distributed training systems for large-scale models only support homogeneous GPU-accelerators, not support heterogeneous GPU-accelerators. To address the problem, this paper proposes a distributed training system with hybrid parallelism, HETHUB, for large-scale models, which supports heterogeneous cluster, including AMD, Nvidia GPU and other types of GPU-accelerators . It introduces a distributed unified communicator to realize the communication between heterogeneous GPU-accelerators, a distributed performance predictor, and an automatic parallel planner to develop and train models efficiently with heterogeneous GPU-accelerators. Compared to the distributed training system with homogeneous GPU-accelerators, our system can support six combinations of heterogeneous GPU-accelerators. We train the Llama-140B model on a heterogeneous cluster with 768 GPU-accelerators(128 AMD and 640 GPU-accelerator A). The experiment results show that the optimal performance of our system in the heterogeneous cluster has achieved up to 97.49% of the theoretical upper bound performance.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (44)

Authors (14)

Si Xu (1 paper)
Zixiao Huang (7 papers)
Yan Zeng (46 papers)
Shengen Yan (26 papers)
Xuefei Ning (52 papers)
Haolin Ye (7 papers)
Sipei Gu (1 paper)
Chunsheng Shui (1 paper)
Zhezheng Lin (1 paper)
Hao Zhang (948 papers)
Sheng Wang (239 papers)
Guohao Dai (51 papers)
Yu Wang (939 papers)
Quanlu Zhang (14 papers)

Tweets

https://twitter.com/HPCPapers/status/1795335062902145496

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models (2405.16256v2)

Related Papers

Tweets