Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Momentum Benefits Non-IID Federated Learning Simply and Provably (2306.16504v3)

Published 28 Jun 2023 in cs.LG and math.OC

Abstract: Federated learning is a powerful paradigm for large-scale machine learning, but it faces significant challenges due to unreliable network connections, slow communication, and substantial data heterogeneity across clients. FedAvg and SCAFFOLD are two prominent algorithms to address these challenges. In particular, FedAvg employs multiple local updates before communicating with a central server, while SCAFFOLD maintains a control variable on each client to compensate for ``client drift'' in its local updates. Various methods have been proposed to enhance the convergence of these two algorithms, but they either make impractical adjustments to the algorithmic structure or rely on the assumption of bounded data heterogeneity. This paper explores the utilization of momentum to enhance the performance of FedAvg and SCAFFOLD. When all clients participate in the training process, we demonstrate that incorporating momentum allows FedAvg to converge without relying on the assumption of bounded data heterogeneity even using a constant local learning rate. This is novel and fairly surprising as existing analyses for FedAvg require bounded data heterogeneity even with diminishing local learning rates. In partial client participation, we show that momentum enables SCAFFOLD to converge provably faster without imposing any additional assumptions. Furthermore, we use momentum to develop new variance-reduced extensions of FedAvg and SCAFFOLD, which exhibit state-of-the-art convergence rates. Our experimental results support all theoretical findings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Sulaiman A Alghunaim. Local exact-diffusion for decentralized optimization and learning. arXiv:2302.00620, 2023.
  2. A unified and refined convergence analysis for non-convex decentralized learning. arXiv preprint arXiv:2110.09993, 2021.
  3. Lower bounds for non-convex stochastic optimization. ArXiv, abs/1912.02365, 2019.
  4. Vafl: a method of vertical asynchronous federated learning. arXiv preprint arXiv:2007.06081, 2020a.
  5. Optimal algorithms for stochastic bilevel optimization under relaxed smoothness conditions. arXiv preprint arXiv:2306.12067, 2023.
  6. Asynchronous online federated learning for edge devices with non-iid data. In 2020 IEEE International Conference on Big Data (Big Data), pp.  15–24. IEEE, 2020b.
  7. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.
  8. Faster non-convex federated learning via global and local momentum. In Uncertainty in Artificial Intelligence, pp.  496–506. PMLR, 2022.
  9. P. Di Lorenzo and G. Scutari. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120–136, 2016.
  10. Federated learning based on dynamic regularization. In International Conference on Learning Representations, 2021.
  11. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. Advances in Neural Information Processing Systems, 33:3557–3568, 2020.
  12. Spider: Near-optimal non-convex optimization via stochastic path integrated differential estimator. In Advances in Neural Information Processing Systems, 2018.
  13. Momentum provably improves error feedback! arXiv preprint arXiv:2305.15155, 2023.
  14. A novel convergence analysis for algorithms of the adam family. arXiv preprint arXiv:2112.03459, 2021.
  15. Federated learning with compression: Unified analysis and sharp guarantees. In International Conference on Artificial Intelligence and Statistics, pp.  2350–2358. PMLR, 2021.
  16. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  770–778, 2016.
  17. Lower bounds and accelerated algorithms in distributed stochastic optimization with communication compression. arXiv preprint arXiv:2305.07612, 2023a.
  18. Unbiased compression saves communication in distributed optimization: When and how much? In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  19. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
  20. Distributed stochastic optimization under a general variance condition. arXiv preprint arXiv:2301.12677, 2023.
  21. Stochastic controlled averaging for federated learning with communication compression. In The Twelfth International Conference on Learning Representations, 2024.
  22. Accelerated federated learning with decoupled adaptive optimization. In International Conference on Machine Learning, pp. 10298–10322. PMLR, 2022.
  23. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  24. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv:2008.03606, 2020a.
  25. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132–5143. PMLR, 2020b.
  26. Stem: A stochastic two-sided momentum algorithm achieving near-optimal sample and communication complexities for federated learning. Advances in Neural Information Processing Systems, 34:6050–6061, 2021.
  27. Communication-efficient federated learning with acceleration of global momentum. arXiv:2201.03172, 2022.
  28. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pp. 5381–5393. PMLR, 2020.
  29. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
  30. A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
  31. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
  32. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2019.
  33. Variance reduced local sgd with lower communication complexity. arXiv preprint arXiv:1912.12844, 2019.
  34. Don’t use large mini-batches, use local sgd. In International Conference on Learning Representations, 2020.
  35. An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems, 33:18261–18271, 2020.
  36. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, 2017a.
  37. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017b.
  38. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34:14606–14619, 2021.
  39. Bias-variance reduced local sgd for less heterogeneous federated learning. In International Conference on Machine Learning, pp. 7872–7881. PMLR, 2021.
  40. On the performance of gradient tracking with local updates. arXiv:2210.04757, 2022.
  41. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pp. 2613–2621. PMLR, 2017.
  42. Towards optimal communication complexity in distributed non-convex optimization. Advances in Neural Information Processing Systems, 35:13316–13328, 2022.
  43. Federated learning with partial model personalization. In International Conference on Machine Learning, pp. 17716–17758. PMLR, 2022.
  44. Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
  45. Distributed stochastic gradient tracking methods. Mathematical Programming, pp.  1–49, 2020.
  46. Adaptive federated optimization. In International Conference on Learning Representations, 2021.
  47. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, pp.  2021–2031. PMLR, 2020.
  48. Privatized graph federated learning. arXiv:2203.07105, 2022.
  49. Sebastian Urban Stich. Local sgd converges fast and communicates little. In International Conference on Learning Representations, 2019.
  50. Personalized federated learning with moreau envelopes. Advances in Neural Information Processing Systems, 33:21394–21405, 2020.
  51. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  52. Cooperative sgd: A unified framework for the design and analysis of local-update sgd algorithms. The Journal of Machine Learning Research, 22(1):9709–9758, 2021.
  53. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020a.
  54. SlowMo: Improving communication-efficient distributed sgd with slow momentum. In International Conference on Learning Representations, 2020b.
  55. An improved convergence analysis for decentralized online stochastic non-convex optimization. IEEE Transactions on Signal Processing, 2020.
  56. Asynchronous federated learning on heterogeneous devices: A survey. arXiv preprint arXiv:2109.04269, 2021a.
  57. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In IEEE Conference on Decision and Control (CDC), pp. 2055–2060, Osaka, Japan, 2015.
  58. FedCM: Federated learning with client-level momentum. arXiv:2106.10874, 2021b.
  59. A unified analysis of stochastic momentum methods for deep learning. In International Joint Conference on Artificial Intelligence, 2018.
  60. Achieving linear speedup with partial worker participation in non-iid federated learning. In International Conference on Learning Representations, 2021.
  61. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In International Conference on Machine Learning, pp. 7184–7193. PMLR, 2019a.
  62. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  5693–5700, 2019b.
  63. Exact diffusion for distributed optimization and learning—part i: Algorithm development. IEEE Transactions on Signal Processing, 67:708–723, 2019.
  64. On the influence of bias-correction on distributed stochastic optimization. IEEE Transactions on Signal Processing, 2020.
  65. DecentLaM: Decentralized momentum sgd for large-batch deep training. International Conference on Computer Vision, pp.  3009–3019, 2021.
  66. Removing data heterogeneity influence enhances network topology dependence of decentralized sgd. Journal of Machine Learning Research, 24(280):1–53, 2023.
  67. Nesterov Yurri. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Norwell, 2004.
  68. FedPD: A federated learning framework with adaptivity to non-iid data. IEEE Transactions on Signal Processing, 69:6055–6070, 2021.
  69. On the convergence properties of a k𝑘kitalic_k-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012, 2017.
Citations (12)

Summary

We haven't generated a summary for this paper yet.