Evaluation and Optimization of Gradient Compression for Distributed Deep Learning (2306.08881v1)
Abstract: To accelerate distributed training, many gradient compression methods have been proposed to alleviate the communication bottleneck in synchronous stochastic gradient descent (S-SGD), but their efficacy in real-world applications still remains unclear. In this work, we first evaluate the efficiency of three representative compression methods (quantization with Sign-SGD, sparsification with Top-k SGD, and low-rank with Power-SGD) on a 32-GPU cluster. The results show that they cannot always outperform well-optimized S-SGD or even worse due to their incompatibility with three key system optimization techniques (all-reduce, pipelining, and tensor fusion) in S-SGD. To this end, we propose a novel gradient compression method, called alternate compressed Power-SGD (ACP-SGD), which alternately compresses and communicates low-rank matrices. ACP-SGD not only significantly reduces the communication volume, but also enjoys the three system optimizations like S-SGD. Compared with Power-SGD, the optimized ACP-SGD can largely reduce the compression and communication overheads, while achieving similar model accuracy. In our experiments, ACP-SGD achieves an average of 4.06x and 1.43x speedups over S-SGD and Power-SGD, respectively, and it consistently outperforms other baselines across different setups (from 8 GPUs to 64 GPUs and from 1Gb/s Ethernet to 100Gb/s InfiniBand).
- P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: Training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
- X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu et al., “Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes,” arXiv preprint arXiv:1807.11205, 2018.
- Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo, “A generic communication scheduler for distributed DNN training acceleration,” in Proc. of SOSP, 2019, pp. 16–29.
- Z. Zhang, C. Chang, H. Lin, Y. Wang, R. Arora, and X. Jin, “Is network the bottleneck of distributed training?” Proceedings of the Workshop on Network Meets AI & ML, 2020.
- H. Tang, S. Gan, A. A. Awan, S. Rajbhandari, C. Li, X. Lian, J. Liu, C. Zhang, and Y. He, “1-bit adam: Communication efficient large-scale training with adam’s convergence speed,” in Proc. of ICML, 2021.
- H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing, “Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters,” in Proc. of USENIX ATC, 2017, pp. 181–193.
- A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep learning in tensorflow,” arXiv preprint arXiv:1802.05799, 2018.
- S. Shi, X. Chu, and B. Li, “MG-WFBP: Efficient data communication for distributed synchronous SGD algorithms,” in Proc. of INFOCOM, 2019, pp. 172–180.
- R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in MPICH,” The International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49–66, 2005.
- S. Shi, Z. Tang, X. Chu, C. Liu, W. Wang, and B. Li, “A quantitative survey of communication optimizations in distributed deep learning,” IEEE Network, pp. 1–8, 2020.
- S. Shi, X. Chu, and B. Li, “MG-WFBP: Merging gradients wisely for efficient communication in distributed deep learning,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 8, pp. 1903–1917, 2021.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of CVPR, 2016, pp. 770–778.
- F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns,” in INTERSPEECH, 2014.
- W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Proc. of NeurIPS, 2017, pp. 1509–1519.
- D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnović, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in Proc. of NeurIPS, 2017, pp. 1709–1720.
- J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signsgd: compressed optimisation for non-convex problems,” in Proc. of ICML, vol. 80, 2018, pp. 559–568.
- Y. Bai, C. Li, Q. Zhou, J. Yi, P. Gong, F. Yan, R. Chen, and Y. Xu, “Gradient compression supercharged high-performance data parallel dnn training,” Proc. of SOSP, 2021.
- Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” in Proc. of ICLR, 2018.
- P. Han, S. Wang, and K. K. Leung, “Adaptive gradient sparsification for efficient federated learning: An online learning approach,” Proc. of ICDCS, pp. 300–310, 2020.
- S. Shi, X. Zhou, S. Song, X. Wang, Z. Zhu, X. Huang, X. Jiang, F. Zhou, Z. Guo, L. Xie et al., “Towards scalable distributed training of deep learning on public cloud clusters,” Proc. of MLSys, vol. 3, 2021.
- S. Li and T. Hoefler, “Near-optimal sparse allreduce for distributed deep learning,” in Proc. of PPoPP, 2022, pp. 135–149.
- H. Wang, S. Sievert, S. Liu, Z. Charles, D. S. Papailiopoulos, and S. J. Wright, “ATOMO: communication-efficient learning via atomic sparsification,” in Proc. of NeurIPS, 2018, pp. 9872–9883.
- T. Vogels, S. P. Karimireddy, and M. Jaggi, “Powersgd: Practical low-rank gradient compression for distributed optimization,” Proc. of NeurIPS, vol. 32, 2019.
- S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “Pytorch distributed,” Proc. of the VLDB Endowment, vol. 13, pp. 3005 – 3018, 2020.
- Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “ImageNet training in minutes,” in Proc. of ICPP, 2018, pp. 1–10.
- Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Training BERT in 76 minutes,” in Proc. of ICLR, 2020.
- S. Shi, X. Chu, and B. Li, “Exploiting simultaneous communications to accelerate data parallel distributed deep learning,” in Proc. of INFOCOM, 2021.
- H. Xu, C.-Y. Ho, A. M. Abdelmoniem, A. Dutta, E. H. Bergou, K. Karatsenidis, M. Canini, and P. Kalnis, “Grace: A compressed communication framework for distributed machine learning,” Proc. of ICDCS, pp. 561–572, 2021.
- S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi, “Error feedback fixes signsgd and other gradient compression schemes,” in Proc. of ICML, 2019.
- A. M Abdelmoniem, A. Elzanaty, M.-S. Alouini, and M. Canini, “An efficient statistical-based gradient compression technique for distributed training systems,” Proc. of MLSys, vol. 3, pp. 297–322, 2021.
- S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” in Proc. of NeurIPS, 2018.
- S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X. Chu, “A distributed synchronous SGD algorithm with global top-k sparsification for low bandwidth networks,” in Proc. of ICDCS, 2019, pp. 2238–2247.
- C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoefler, “SparCML: High-performance sparse communication for machine learning,” in Proc. of SC, 2019, pp. 1–15.
- A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in Proc. of ICML. PMLR, 2021, pp. 8821–8831.
- S. Agarwal, H. Wang, S. Venkataraman, and D. S. Papailiopoulos, “On the utility of gradient compression in distributed training systems,” in Proc. of MLSys, 2022.
- Z. Wang, H. Lin, Y. Zhu, and T. S. E. Ng, “Bytecomp: Revisiting gradient compression in distributed training,” in Arxiv abs/2205.14465, 2022.
- S. Shi, X. Chu, K. C. Cheung, and S. See, “Understanding top-k sparsification in distributed deep learning,” arXiv preprint arXiv:1911.08772, 2019.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT, 2019, pp. 4171–4186.
- M. Cho, U. Finkler, D. S. Kung, and H. C. Hunter, “BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy,” in Proc. of MLSys, 2019.
- S. Shi, Q. Wang, X. Chu, B. Li, Y. Qin, R. Liu, and X. Zhao, “Communication-efficient distributed deep learning with merged gradient sparsification on GPUs,” in Proc. of INFOCOM, 2020.
- S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, “Error feedback fixes SignSGD and other gradient compression schemes,” in Proc. of ICML, 2019, pp. 3252–3261.
- L. Zhang, S. Shi, X. Chu, W. Wang, B. Li, and C. Liu, “Decoupling the all-reduce primitive for accelerating distributed deep learning,” arXiv preprint arXiv:2302.12445, 2023.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. of ICLR, 2015.
- A. Krizhevsky, “Learning multiple layers of features from tiny images,” Citeseer, 2009.
- Y. Wang, A. Iankoulski, P. Damania, and S. Ranganathan, “Accelerating pytorch ddp by 10x with powersgd,” 2021.
- Lin Zhang (342 papers)
- Longteng Zhang (4 papers)
- Shaohuai Shi (47 papers)
- Xiaowen Chu (108 papers)
- Bo Li (1107 papers)