Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AdaGossip: Adaptive Consensus Step-size for Decentralized Deep Learning with Communication Compression (2404.05919v1)

Published 9 Apr 2024 in cs.LG

Abstract: Decentralized learning is crucial in supporting on-device learning over large distributed datasets, eliminating the need for a central server. However, the communication overhead remains a major bottleneck for the practical realization of such decentralized setups. To tackle this issue, several algorithms for decentralized training with compressed communication have been proposed in the literature. Most of these algorithms introduce an additional hyper-parameter referred to as consensus step-size which is tuned based on the compression ratio at the beginning of the training. In this work, we propose AdaGossip, a novel technique that adaptively adjusts the consensus step-size based on the compressed model differences between neighboring agents. We demonstrate the effectiveness of the proposed method through an exhaustive set of experiments on various Computer Vision datasets (CIFAR-10, CIFAR-100, Fashion MNIST, Imagenette, and ImageNet), model architectures, and network topologies. Our experiments show that the proposed method achieves superior performance ($0-2\%$ improvement in test accuracy) compared to the current state-of-the-art method for decentralized learning with communication compression.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Low precision decentralized distributed training over iid and non-iid data. Neural Networks, 2022.
  2. Sparse-push: Communication-& energy-efficient decentralized distributed learning over directed & time-varying graphs with non-iid datasets. arXiv preprint arXiv:2102.05715, 2021.
  3. Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, pages 344–353. PMLR, 2019.
  4. Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
  5. An image is worth 16x16 words: Transformers for image recognition at scale. 2021.
  6. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  7. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  8. Hamel Husain. Imagenette - a subset of 10 easily classified classes from the imagenet dataset. https://github.com/fastai/imagenette, 2018.
  9. Decentralized deep learning with arbitrary communication compression. arXiv preprint arXiv:1907.09356, 2019.
  10. Decentralized stochastic optimization and gossip algorithms with compressed communication. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3478–3487. PMLR, 09–15 Jun 2019.
  11. Federated optimization: Distributed machine learning for on-device intelligence. 2016.
  12. Cifar (canadian institute for advanced research). http://www.cs.toronto.edu/ kriz/cifar.html, 2014.
  13. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  14. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in Neural Information Processing Systems, 30, 2017.
  15. Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 6654–6665. PMLR, 18–24 Jul 2021.
  16. Evolving normalization-activation layers. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13539–13550. Curran Associates, Inc., 2020.
  17. Moniqua: Modulo quantized communication in decentralized SGD. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6415–6425. PMLR, Jul 2020.
  18. Angelia Nedic. Distributed gradient methods for convex machine learning problems in networks: Distributed optimization. IEEE Signal Processing Magazine, 37(3):92–101, 2020.
  19. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  20. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  21. Quantized decentralized stochastic learning over directed graphs. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 9324–9333, 2020.
  22. Deepsqueeze: Decentralization meets error-compensated compression. arXiv preprint arXiv:1907.07346, 2019.
  23. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 32, 2019.
  24. Powergossip: Practical low-rank communication compression in decentralized deep learning. arXiv preprint arXiv:2008.01425, 2020.
  25. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  26. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78, 2004.
  27. Beer: Fast o⁢(1/t)𝑜1𝑡o(1/t)italic_o ( 1 / italic_t ) rate for decentralized nonconvex optimization with communication compression. Advances in Neural Information Processing Systems, 35:31653–31667, 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets