Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Global Momentum Compression for Sparse Communication in Distributed Learning (1905.12948v3)

Published 30 May 2019 in stat.ML and cs.LG

Abstract: With the rapid growth of data, distributed momentum stochastic gradient descent~(DMSGD) has been widely used in distributed learning, especially for training large-scale deep models. Due to the latency and limited bandwidth of the network, communication has become the bottleneck of distributed learning. Communication compression with sparsified gradient, abbreviated as \emph{sparse communication}, has been widely employed to reduce communication cost. All existing works about sparse communication in DMSGD employ local momentum, in which the momentum only accumulates stochastic gradients computed by each worker locally. In this paper, we propose a novel method, called \emph{\underline{g}}lobal \emph{\underline{m}}omentum \emph{\underline{c}}ompression~(GMC), for sparse communication. Different from existing works that utilize local momentum, GMC utilizes global momentum. Furthermore, to enhance the convergence performance when using more aggressive sparsification compressors (e.g., RBGS), we extend GMC to GMC+. We theoretically prove the convergence of GMC and GMC+. To the best of our knowledge, this is the first work that introduces global momentum for sparse communication in distributed learning. Empirical results demonstrate that, compared with the local momentum counterparts, our GMC and GMC+ can achieve higher test accuracy and exhibit faster convergence, especially under non-IID data distribution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Sparse communication for distributed gradient descent. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 440–445, 2017.
  2. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1707–1718, 2017.
  3. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5977–5987, 2018.
  4. Qsparse-local-sgd: Distributed SGD with quantization, sparsification and local computations. In Advances in Neural Information Processing Systems, pages 14668–14679, 2019.
  5. Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of the International Conference on Computational Statistics, pages 177–186, 2010.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877–1901, 2020.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019.
  8. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  9. The non-iid data quagmire of decentralized machine learning. In Proceedings of the International Conference on Machine Learning, pages 4387–4398, 2020.
  10. Measuring the effects of non-identical data distribution for federated visual classification. CoRR, abs/1909.06335, 2019.
  11. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems, pages 2530–2541, 2018.
  12. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
  13. Error feedback fixes signsgd and other gradient compression schemes. In Proceedings of International Conference on Machine Learning, pages 3252–3261, 2019.
  14. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representations, 2015.
  15. Decentralized deep learning with arbitrary communication compression. In Proceedings of International Conference on Learning Representations, 2020.
  16. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1106–1114, 2012.
  17. Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012.
  18. Vision transformer for small-size datasets. CoRR, abs/2112.13492, 2021.
  19. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation, pages 583–598, 2014.
  20. Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data. In Proceedings of International Conference on Machine Learning, pages 6654–6665, 2021.
  21. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proceedings of International Conference on Learning Representations, 2018.
  22. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of International Conference on Learning Representations, 2017.
  23. Boris Polyak. Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4:1–17, 12 1964.
  24. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
  25. Sparsified SGD with memory. In Advances in Neural Information Processing Systems, pages 4452–4463, 2018.
  26. On the importance of initialization and momentum in deep learning. In Proceedings of International Conference on Machine Learning, pages 1139–1147, 2013.
  27. Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In Proceedings of International Conference on Machine Learning, pages 6155–6165, 2019.
  28. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  29. Paul Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
  30. Powersgd: Practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems, 2019.
  31. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1508–1518, 2017.
  32. Group normalization. In Proceedings of the European Conference on Computer Vision, pages 3–19, 2018.
  33. CSER: Communication-efficient SGD with error reset. In Advances in Neural Information Processing Systems, pages 12593–12603, 2020.
  34. An Xu and Heng Huang. Detached error feedback for distributed SGD with random sparsification. In Proceedings of International Conference on Machine Learning, pages 24550–24575, 2022.
  35. Proximal scope for distributed sparse learning. In Advances in Neural Information Processing Systems, 2018.
  36. Stochastic normalized gradient descent with momentum for large batch training. CoRR, abs/2007.13985, 2020.
  37. On the convergence and improvement of stochastic normalized gradient descent. Sci. China Inf. Sci., 64(3), 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chang-Wei Shi (4 papers)
  2. Shen-Yi Zhao (13 papers)
  3. Yin-Peng Xie (3 papers)
  4. Hao Gao (59 papers)
  5. Wu-Jun Li (57 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com