Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning (2306.08881v1)

Published 15 Jun 2023 in cs.LG and cs.DC

Abstract: To accelerate distributed training, many gradient compression methods have been proposed to alleviate the communication bottleneck in synchronous stochastic gradient descent (S-SGD), but their efficacy in real-world applications still remains unclear. In this work, we first evaluate the efficiency of three representative compression methods (quantization with Sign-SGD, sparsification with Top-k SGD, and low-rank with Power-SGD) on a 32-GPU cluster. The results show that they cannot always outperform well-optimized S-SGD or even worse due to their incompatibility with three key system optimization techniques (all-reduce, pipelining, and tensor fusion) in S-SGD. To this end, we propose a novel gradient compression method, called alternate compressed Power-SGD (ACP-SGD), which alternately compresses and communicates low-rank matrices. ACP-SGD not only significantly reduces the communication volume, but also enjoys the three system optimizations like S-SGD. Compared with Power-SGD, the optimized ACP-SGD can largely reduce the compression and communication overheads, while achieving similar model accuracy. In our experiments, ACP-SGD achieves an average of 4.06x and 1.43x speedups over S-SGD and Power-SGD, respectively, and it consistently outperforms other baselines across different setups (from 8 GPUs to 64 GPUs and from 1Gb/s Ethernet to 100Gb/s InfiniBand).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: Training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
  2. X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu et al., “Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes,” arXiv preprint arXiv:1807.11205, 2018.
  3. Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo, “A generic communication scheduler for distributed DNN training acceleration,” in Proc. of SOSP, 2019, pp. 16–29.
  4. Z. Zhang, C. Chang, H. Lin, Y. Wang, R. Arora, and X. Jin, “Is network the bottleneck of distributed training?” Proceedings of the Workshop on Network Meets AI & ML, 2020.
  5. H. Tang, S. Gan, A. A. Awan, S. Rajbhandari, C. Li, X. Lian, J. Liu, C. Zhang, and Y. He, “1-bit adam: Communication efficient large-scale training with adam’s convergence speed,” in Proc. of ICML, 2021.
  6. H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing, “Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters,” in Proc. of USENIX ATC, 2017, pp. 181–193.
  7. A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep learning in tensorflow,” arXiv preprint arXiv:1802.05799, 2018.
  8. S. Shi, X. Chu, and B. Li, “MG-WFBP: Efficient data communication for distributed synchronous SGD algorithms,” in Proc. of INFOCOM, 2019, pp. 172–180.
  9. R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in MPICH,” The International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49–66, 2005.
  10. S. Shi, Z. Tang, X. Chu, C. Liu, W. Wang, and B. Li, “A quantitative survey of communication optimizations in distributed deep learning,” IEEE Network, pp. 1–8, 2020.
  11. S. Shi, X. Chu, and B. Li, “MG-WFBP: Merging gradients wisely for efficient communication in distributed deep learning,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 8, pp. 1903–1917, 2021.
  12. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of CVPR, 2016, pp. 770–778.
  13. F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns,” in INTERSPEECH, 2014.
  14. W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Proc. of NeurIPS, 2017, pp. 1509–1519.
  15. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnović, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in Proc. of NeurIPS, 2017, pp. 1709–1720.
  16. J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signsgd: compressed optimisation for non-convex problems,” in Proc. of ICML, vol. 80, 2018, pp. 559–568.
  17. Y. Bai, C. Li, Q. Zhou, J. Yi, P. Gong, F. Yan, R. Chen, and Y. Xu, “Gradient compression supercharged high-performance data parallel dnn training,” Proc. of SOSP, 2021.
  18. Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” in Proc. of ICLR, 2018.
  19. P. Han, S. Wang, and K. K. Leung, “Adaptive gradient sparsification for efficient federated learning: An online learning approach,” Proc. of ICDCS, pp. 300–310, 2020.
  20. S. Shi, X. Zhou, S. Song, X. Wang, Z. Zhu, X. Huang, X. Jiang, F. Zhou, Z. Guo, L. Xie et al., “Towards scalable distributed training of deep learning on public cloud clusters,” Proc. of MLSys, vol. 3, 2021.
  21. S. Li and T. Hoefler, “Near-optimal sparse allreduce for distributed deep learning,” in Proc. of PPoPP, 2022, pp. 135–149.
  22. H. Wang, S. Sievert, S. Liu, Z. Charles, D. S. Papailiopoulos, and S. J. Wright, “ATOMO: communication-efficient learning via atomic sparsification,” in Proc. of NeurIPS, 2018, pp. 9872–9883.
  23. T. Vogels, S. P. Karimireddy, and M. Jaggi, “Powersgd: Practical low-rank gradient compression for distributed optimization,” Proc. of NeurIPS, vol. 32, 2019.
  24. S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “Pytorch distributed,” Proc. of the VLDB Endowment, vol. 13, pp. 3005 – 3018, 2020.
  25. Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “ImageNet training in minutes,” in Proc. of ICPP, 2018, pp. 1–10.
  26. Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Training BERT in 76 minutes,” in Proc. of ICLR, 2020.
  27. S. Shi, X. Chu, and B. Li, “Exploiting simultaneous communications to accelerate data parallel distributed deep learning,” in Proc. of INFOCOM, 2021.
  28. H. Xu, C.-Y. Ho, A. M. Abdelmoniem, A. Dutta, E. H. Bergou, K. Karatsenidis, M. Canini, and P. Kalnis, “Grace: A compressed communication framework for distributed machine learning,” Proc. of ICDCS, pp. 561–572, 2021.
  29. S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi, “Error feedback fixes signsgd and other gradient compression schemes,” in Proc. of ICML, 2019.
  30. A. M Abdelmoniem, A. Elzanaty, M.-S. Alouini, and M. Canini, “An efficient statistical-based gradient compression technique for distributed training systems,” Proc. of MLSys, vol. 3, pp. 297–322, 2021.
  31. S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” in Proc. of NeurIPS, 2018.
  32. S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X. Chu, “A distributed synchronous SGD algorithm with global top-k sparsification for low bandwidth networks,” in Proc. of ICDCS, 2019, pp. 2238–2247.
  33. C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoefler, “SparCML: High-performance sparse communication for machine learning,” in Proc. of SC, 2019, pp. 1–15.
  34. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in Proc. of ICML.   PMLR, 2021, pp. 8821–8831.
  35. S. Agarwal, H. Wang, S. Venkataraman, and D. S. Papailiopoulos, “On the utility of gradient compression in distributed training systems,” in Proc. of MLSys, 2022.
  36. Z. Wang, H. Lin, Y. Zhu, and T. S. E. Ng, “Bytecomp: Revisiting gradient compression in distributed training,” in Arxiv abs/2205.14465, 2022.
  37. S. Shi, X. Chu, K. C. Cheung, and S. See, “Understanding top-k sparsification in distributed deep learning,” arXiv preprint arXiv:1911.08772, 2019.
  38. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT, 2019, pp. 4171–4186.
  39. M. Cho, U. Finkler, D. S. Kung, and H. C. Hunter, “BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy,” in Proc. of MLSys, 2019.
  40. S. Shi, Q. Wang, X. Chu, B. Li, Y. Qin, R. Liu, and X. Zhao, “Communication-efficient distributed deep learning with merged gradient sparsification on GPUs,” in Proc. of INFOCOM, 2020.
  41. S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, “Error feedback fixes SignSGD and other gradient compression schemes,” in Proc. of ICML, 2019, pp. 3252–3261.
  42. L. Zhang, S. Shi, X. Chu, W. Wang, B. Li, and C. Liu, “Decoupling the all-reduce primitive for accelerating distributed deep learning,” arXiv preprint arXiv:2302.12445, 2023.
  43. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. of ICLR, 2015.
  44. A. Krizhevsky, “Learning multiple layers of features from tiny images,” Citeseer, 2009.
  45. Y. Wang, A. Iankoulski, P. Damania, and S. Ranganathan, “Accelerating pytorch ddp by 10x with powersgd,” 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Lin Zhang (342 papers)
  2. Longteng Zhang (4 papers)
  3. Shaohuai Shi (47 papers)
  4. Xiaowen Chu (108 papers)
  5. Bo Li (1107 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.