Correlated Quantization for Faster Nonconvex Distributed Optimization (2401.05518v1)
Abstract: Quantization (Alistarh et al., 2017) is an important (stochastic) compression technique that reduces the volume of transmitted bits during each communication round in distributed model training. Suresh et al. (2022) introduce correlated quantizers and show their advantages over independent counterparts by analyzing distributed SGD communication complexity. We analyze the forefront distributed non-convex optimization algorithm MARINA (Gorbunov et al., 2022) utilizing the proposed correlated quantizers and show that it outperforms the original MARINA and distributed SGD of Suresh et al. (2022) with regard to the communication complexity. We significantly refine the original analysis of MARINA without any additional assumptions using the weighted Hessian variance (Tyurin et al., 2022), and then we expand the theoretical framework of MARINA to accommodate a substantially broader range of potentially correlated and biased compressors, thus dilating the applicability of the method beyond the conventional independent unbiased compressor setup. Extensive experimental results corroborate our theoretical findings.
- QSGD: Communication-efficient SGD via gradient quantization and encoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2:183–202, 2009.
- On biased compression for distributed learning. arXiv preprint arXiv:2002.12410, 2020.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
- LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- The loss surfaces of multilayer networks. In G. Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 192–204, San Diego, California, USA, 09–12 May 2015. PMLR.
- TAMUNA: Doubly accelerated federated learning with local training, compression, and partial participation. 2023. URL https://api.semanticscholar.org/CorpusID:258887349.
- A guide through the zoo of biased SGD. arXiv preprint arXiv:2305.16296, 2023.
- MARINA: Faster non-convex distributed learning with compression, 2022.
- Improving accelerated federated learning with compression and importance sampling. ArXiv, abs/2306.03240, 2023. URL https://api.semanticscholar.org/CorpusID:259089111.
- Natural compression for distributed deep learning. In Proceedings of MSML’22, Aug 2022.
- Stochastic distributed learning with gradient quantization and double-variance reduction. Optimization Methods and Software, 38(1):91–106, 2023.
- P. Kairouz and et. al. Advances and open problems in federated learning, 2019.
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
- Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In International Conference on Algorithmic Learning Theory, 2019.
- A unified variance-reduced accelerated gradient method for convex optimization. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. A. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 10462–10472, 2019.
- Deep learning. Nature, 521(7553):436, 2015.
- Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020a.
- Acceleration for compressed gradient descent in distributed and federated optimization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020b.
- Acceleration for compressed gradient descent in distributed and federated optimization. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5895–5904. PMLR, 13–18 Jul 2020c.
- Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 6286–6295. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/li21a.html.
- 3lc: Lightweight and effective traffic compression for distributed machine learning. ArXiv, abs/1802.07389, 2018. URL https://api.semanticscholar.org/CorpusID:3392638.
- Deep gradient compression: Reducing the communication bandwidth for distributed training. ArXiv, abs/1712.01887, 2017. URL https://api.semanticscholar.org/CorpusID:38796293.
- P. Mayekar and H. Tyagi. Ratq: A universal fixed-length quantizer for stochastic optimization, 2019.
- Communication-efficient learning of deep networks from decentralized data. In A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1273–1282. PMLR, 20–22 Apr 2017.
- Distributed learning with compressed gradient differences. ArXiv, abs/1901.09269, 2019.
- ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally! In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 15750–15769. PMLR, 17–23 Jul 2022.
- SparkNet: Training deep networks in spark. In Y. Bengio and Y. LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
- Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In Doklady AN USSR, 269:543–547, 1983.
- Y. Nesterov. Introductory lectures on convex optimization: a basic course (Applied Optimization). Kluwer Academic Publishers, 2004.
- Parallel training of dnns with natural gradient and parameter averaging. arXiv: Neural and Evolutionary Computing, 2014.
- EF21: A new, simpler, theoretically better, and practically faster error feedback. arXiv preprint arXiv:2106.05203, 2021.
- Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. Information and Inference: A Journal of the IMA, 11(2):557–580, apr 2021. doi: 10.1093/imaiai/iaab006. URL https://doi.org/10.1093%2Fimaiai%2Fiaab006.
- 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Interspeech, 2014.
- Distributed mean estimation with limited communication. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3329–3337. JMLR.org, 2017.
- Correlated quantization for distributed mean estimation and optimization, 2022.
- Permutation compressors for provably faster distributed nonconvex optimization, 2021.
- Sharper rates and flexible framework for nonconvex SGD with client and data sampling. arXiv preprint arXiv:2206.02275, 2022.
- DRIVE: One-bit distributed mean estimation, 2021.
- Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Atomo: Communication-efficient learning via atomic sparsification. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/33b3214d792caf311e1f00fd22b392c5-Paper.pdf.
- CocktailSGD: Fine-tuning foundation models over 500Mbps networks. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 36058–36076. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/wang23t.html.
- TernGrad: Ternary gradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- A survey of distributed optimization. Annual Reviews in Control, 47:278–305, 2019. ISSN 1367-5788.
- ZipML: Training linear models with end-to-end low precision, and a little bit of deep learning. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4035–4043. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/zhang17e.html.
- Z. A. Zhu. Katyusha: the first direct acceleration of stochastic gradient methods. In Symposium on the Theory of Computing, 2017.