Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication (2405.11667v1)

Published 19 May 2024 in cs.LG, cs.DC, math.OC, and stat.ML

Abstract: Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing that these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which fully resolves our understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. Towards this end, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. On the unreasonable effectiveness of federated averaging with heterogeneous data. arXiv preprint arXiv:2206.04723, 2022.
  2. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  3. Federated learning of deep networks using model averaging. arXiv preprint arXiv:1602.05629, 2016a.
  4. Communication-efficient learning of deep networks from decentralized data (2016). arXiv preprint arXiv:1602.05629, 2016b.
  5. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. Advances in neural information processing systems, 31, 2018.
  6. Advances and open problems in federated learning. corr. arXiv preprint arXiv:1912.04977, 2019.
  7. A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
  8. On large-cohort training for federated learning. Advances in neural information processing systems, 34:20461–20475, 2021.
  9. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
  10. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pages 10334–10343. PMLR, 2020a.
  11. Efficient large-scale distributed training of conditional maximum entropy models. Advances in neural information processing systems, 22, 2009.
  12. Parallelized stochastic gradient descent. Advances in neural information processing systems, 23, 2010.
  13. Parallel sgd: When does averaging help? arXiv preprint arXiv:1606.07365, 2016.
  14. Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
  15. Communication trade-offs for local-sgd with large step size. Advances in Neural Information Processing Systems, 32, 2019.
  16. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 4519–4529. PMLR, 2020.
  17. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pages 5381–5393. PMLR, 2020.
  18. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132–5143. PMLR, 2020a.
  19. Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33:6281–6292, 2020b.
  20. Federated accelerated stochastic gradient descent. Advances in Neural Information Processing Systems, 33:5332–5344, 2020.
  21. The min-max complexity of distributed stochastic convex optimization with intermittent communication. In Conference on Learning Theory, pages 4386–4437. PMLR, 2021.
  22. Sharp bounds for federated averaging (local sgd) and continuous perspective. In International Conference on Artificial Intelligence and Statistics, pages 9050–9090. PMLR, 2022.
  23. Federated online and bandit convex optimization. In International Conference on Machine Learning, pages 27439–27460. PMLR, 2023.
  24. Federated learning: Collaborative machine learning without centralized training data, Apr 2017. URL https://ai.googleblog.com/2017/04/federated-learning-collaborative.html.
  25. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012.
  26. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
  27. Collaborative pac learning. Advances in Neural Information Processing Systems, 30, 2017.
  28. Improved algorithms for collaborative pac learning. Advances in Neural Information Processing Systems, 31, 2018.
  29. On-demand sampling: Learning optimally from multiple distributions. Advances in Neural Information Processing Systems, 35:406–419, 2022.
  30. On the effect of defections in federated learning and how to prevent them. arXiv preprint arXiv:2311.16459, 2023.
  31. Fedsplit: An algorithmic framework for fast federated optimization. Advances in neural information processing systems, 33:7057–7066, 2020.
  32. On the outsized importance of learning rates in local update methods. arXiv preprint arXiv:2007.00878, 2020.
  33. Fedexp: Speeding up federated averaging via extrapolation. arXiv preprint arXiv:2301.09604, 2023.
  34. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint arXiv:2008.03606, 2020b.
  35. Bias-variance reduced local sgd for less heterogeneous federated learning. arXiv preprint arXiv:2102.03198, 2021.
  36. Towards optimal communication complexity in distributed non-convex optimization. In Advances in Neural Information Processing Systems, 2022.
  37. A stochastic newton algorithm for distributed convex optimization. Advances in Neural Information Processing Systems, 34, 2021.
  38. Fedchain: Chained algorithms for near-optimal communication cost in federated learning. arXiv preprint arXiv:2108.06869, 2021.
  39. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120, 2020.
  40. Fedpage: A fast local stochastic gradient method for communication-efficient federated learning. arXiv preprint arXiv:2108.04755, 2021.
  41. Stem: A stochastic two-sided momentum algorithm achieving near-optimal sample and communication complexities for federated learning. Advances in Neural Information Processing Systems, 34, 2021.
  42. Gene H Golub. Cme 302: Numerical linear algebra fall 2005/06 lecture 10. 2005.
  43. Stochastic newton and cubic newton methods with simple local linear-quadratic rates. arXiv preprint arXiv:1912.01597, 2019.
  44. Sebastian U Stich. Unified optimal analysis of the (stochastic) gradient method. arXiv preprint arXiv:1907.04232, 2019.
  45. Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Kumar Kshitij Patel (11 papers)
  2. Margalit Glasgow (15 papers)
  3. Ali Zindari (3 papers)
  4. Lingxiao Wang (74 papers)
  5. Sebastian U. Stich (66 papers)
  6. Ziheng Cheng (16 papers)
  7. Nirmit Joshi (8 papers)
  8. Nathan Srebro (145 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com