Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters (2010.10458v1)

Published 20 Oct 2020 in cs.DC and cs.AI

Abstract: Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (24)
  1. Shaohuai Shi (47 papers)
  2. Xianhao Zhou (2 papers)
  3. Shutao Song (2 papers)
  4. Xingyao Wang (29 papers)
  5. Zilin Zhu (4 papers)
  6. Xue Huang (2 papers)
  7. Xinan Jiang (1 paper)
  8. Feihu Zhou (4 papers)
  9. Zhenyu Guo (21 papers)
  10. Liqiang Xie (2 papers)
  11. Rui Lan (3 papers)
  12. Xianbin Ouyang (1 paper)
  13. Yan Zhang (954 papers)
  14. Jieqian Wei (1 paper)
  15. Jing Gong (17 papers)
  16. Weiliang Lin (2 papers)
  17. Ping Gao (19 papers)
  18. Peng Meng (5 papers)
  19. Xiaomin Xu (4 papers)
  20. Chenyang Guo (12 papers)
Citations (53)