Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters (2010.10458v1)
Abstract: Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.
- Shaohuai Shi (47 papers)
- Xianhao Zhou (2 papers)
- Shutao Song (2 papers)
- Xingyao Wang (29 papers)
- Zilin Zhu (4 papers)
- Xue Huang (2 papers)
- Xinan Jiang (1 paper)
- Feihu Zhou (4 papers)
- Zhenyu Guo (21 papers)
- Liqiang Xie (2 papers)
- Rui Lan (3 papers)
- Xianbin Ouyang (1 paper)
- Yan Zhang (954 papers)
- Jieqian Wei (1 paper)
- Jing Gong (17 papers)
- Weiliang Lin (2 papers)
- Ping Gao (19 papers)
- Peng Meng (5 papers)
- Xiaomin Xu (4 papers)
- Chenyang Guo (12 papers)