Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stochastic Controlled Averaging for Federated Learning with Communication Compression (2308.08165v2)

Published 16 Aug 2023 in math.OC, cs.DC, cs.LG, and stat.ML

Abstract: Communication compression, a technique aiming to reduce the information volume to be transmitted over the air, has gained great interests in Federated Learning (FL) for the potential of alleviating its communication overhead. However, communication compression brings forth new challenges in FL due to the interplay of compression-incurred information distortion and inherent characteristics of FL such as partial participation and data heterogeneity. Despite the recent development, the performance of compressed FL approaches has not been fully exploited. The existing approaches either cannot accommodate arbitrary data heterogeneity or partial participation, or require stringent conditions on compression. In this paper, we revisit the seminal stochastic controlled averaging method by proposing an equivalent but more efficient/simplified formulation with halved uplink communication costs. Building upon this implementation, we propose two compressed FL algorithms, SCALLION and SCAFCOM, to support unbiased and biased compression, respectively. Both the proposed methods outperform the existing compressed FL methods in terms of communication and computation complexities. Moreover, SCALLION and SCAFCOM accommodates arbitrary data heterogeneity and do not make any additional assumptions on compression errors. Experiments show that SCALLION and SCAFCOM can match the performance of corresponding full-precision FL approaches with substantially reduced uplink communication, and outperform recent compressed FL methods under the same communication budget.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Sulaiman A Alghunaim. Local exact-diffusion for decentralized optimization and learning. arXiv preprint arXiv:2302.00620, 2023.
  2. QSGD: communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems (NIPS), pages 1709–1720, Long Beach, CA, 2017.
  3. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems (NeurIPS), pages 5973–5983, Montréal, Canada, 2018.
  4. Qsparse-local-SGD: Distributed SGD with quantization, sparsification and local computations. In Advances in Neural Information Processing Systems (NeurIPS), pages 14668–14679, Vancouver, Canada, 2019.
  5. SIGNSGD: compressed optimisation for non-convex problems. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 559–568, Stockholmsmässan, Stockholm, Sweden, 2018.
  6. On biased compression for distributed learning. arXiv preprint arXiv:2002.12410, 2020.
  7. On large-cohort training for federated learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 20461–20475, virtual, 2021.
  8. Optimal client sampling for federated learning. Trans. Mach. Learn. Res., 2022.
  9. Toward communication efficient adaptive gradient method. In Proceedings of the ACM-IMS Foundations of Data Science Conference (FODS), pages 119–128, Virtual Event, USA, 2020.
  10. Optimal algorithms for stochastic bilevel optimization under relaxed smoothness conditions. arXiv preprint arXiv:2306.12067, 2023.
  11. Momentum benefits non-iid federated learning simply and provably. In The Twelfth International Conference on Learning Representations, 2024.
  12. Faster non-convex federated learning via global and local momentum. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI), pages 496–506, Eindhoven, Netherlands, 2022.
  13. Tim Dettmers. 8-bit approximations for parallelism in deep learning. In Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2016.
  14. EF21 with bells & whistles: Practical algorithmic extensions of modern error feedback. arXiv preprint arXiv:2110.03294, 2021.
  15. Momentum provably improves error feedback! arXiv preprint arXiv:2305.15155, 2023.
  16. On the convergence of communication-efficient local SGD for federated learning. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), pages 7510–7518, Virtual Event, 2021.
  17. Federated learning with compression: Unified analysis and sharp guarantees. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2350–2358, Virtual Event, 2021.
  18. Lower bounds and accelerated algorithms in distributed stochastic optimization with communication compression. arXiv preprint arXiv:2305.07612, 2023a.
  19. Unbiased compression saves communication in distributed optimization: When and how much? arXiv preprint arXiv:2305.16297, 2023b.
  20. Natural compression for distributed deep learning. In Mathematical and Scientific Machine Learning, pages 129–141, 2022.
  21. Lower bounds and nearly optimal algorithms in distributed learning with communication compression. In Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, 2022.
  22. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems (NeurIPS), pages 2530–2541, Montréal, Canada, 2018.
  23. Advances and open problems in federated learning. Found. Trends Mach. Learn., 14(1-2):1–210, 2021.
  24. Fed-LAMB: Layer-wise and dimension-wise locally adaptive federated learning. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI), Pittsburg, Kansus, 2023.
  25. Error feedback fixes SignSGD and other gradient compression schemes. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 3252–3261, Long Beach, CA, 2019.
  26. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint arXiv:2008.03606, 2020a.
  27. SCAFFOLD: stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 5132–5143, Virtual Event, 2020b.
  28. STEM: A stochastic two-sided momentum algorithm achieving near-optimal sample and communication complexities for federated learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 6050–6061, virtual, 2021.
  29. Decentralized stochastic optimization and gossip algorithms with compressed communication. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 3478–3487, Long Beach, CA, 2019.
  30. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
  31. Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  32. Federated learning on non-iid data silos: An experimental study. In Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE), pages 965–978, Kuala Lumpur, Malaysia, 2022a.
  33. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag., 37(3):50–60, 2020a.
  34. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems (MLSys), Austin, TX, 2020b.
  35. On the convergence of FedAvg on non-IID data. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 2020c.
  36. Analysis of error feedback in federated non-convex optimization with biased compression: Fast convergence and partial participation. In Proceedings of the 40th International Conference on Machine Learning (ICML), pages 19638–19688, Honolulu, HI, 2023.
  37. On distributed adaptive optimization with gradient compression. In Proceedings of the Tenth International Conference on Learning Representations (ICLR), Virtual Event, 2022b.
  38. PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 6286–6295, Virtual Event, 2021.
  39. Variance reduced local SGD with lower communication complexity. arXiv preprint arXiv:1912.12844, 2019.
  40. Don’t use large mini-batches, use local SGD. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 2020.
  41. An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems (NeurIPS), virtual, 2020.
  42. FEDZIP: A compression framework for communication-efficient federated learning. arXiv preprint arXiv:2102.01593, 2021.
  43. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, Fort Lauderdale, FL, 2017.
  44. Agnostic federated learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 4615–4625, Long Beach, CA, 2019.
  45. Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
  46. Adaptive federated optimization. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, Austria, 2021.
  47. FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2021–2031, Online [Palermo, Sicily, Italy], 2020.
  48. EF21: A new, simpler, theoretically better, and practically faster error feedback. In Advances in Neural Information Processing Systems (NeurIPS), pages 4384–4396, virtual, 2021.
  49. Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. Information and Inference: A Journal of the IMA, 11(2):557–580, 2022.
  50. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), pages 1058–1062, Singapore, 2014.
  51. Sebastian U. Stich. Local SGD converges fast and communicates little. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, 2019.
  52. Sparsified SGD with memory. In Advances in Neural Information Processing Systems (NeurIPS), pages 4452–4463, Montréal, Canada, 2018.
  53. Cooperative SGD: A unified framework for the design and analysis of local-update SGD algorithms. J. Mach. Learn. Res., 22:213:1–213:50, 2021.
  54. Tackling the objective inconsistency problem in heterogeneous federated optimization. In Advances in Neural Information Processing Systems (NeurIPS), virtual, 2020a.
  55. SlowMo: Improving communication-efficient distributed SGD with slow momentum. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 2020b.
  56. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems (NeurIPS), pages 1306–1316, Montréal, Canada, 2018.
  57. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems (NIPS), pages 1509–1519, Long Beach, CA, 2017.
  58. Error compensated quantized SGD and its applications to large-scale distributed optimization. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 5321–5329, Stockholmsmässan, Stockholm, Sweden, 2018.
  59. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  60. A unified analysis of stochastic momentum methods for deep learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), pages 2955–2961, Stockholm, Sweden, 2018.
  61. Achieving linear speedup with partial worker participation in non-iid federated learning. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, Austria, 2021.
  62. Federated learning via over-the-air computation. IEEE Trans. Wirel. Commun., 19(3):2022–2035, 2020.
  63. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 7184–7193, Long Beach, CA, 2019a.
  64. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI), pages 5693–5700, Honolulu, HI, 2019b.
  65. DecentLaM: Decentralized momentum SGD for large-batch deep training. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3009–3019, Montreal, Canada, 2021.
  66. Removing data heterogeneity influence enhances network topology dependence of decentralized sgd. Journal of Machine Learning Research, 24(280):1–53, 2023.
  67. On convergence of FedProx: Local dissimilarity invariant bounds, non-smoothness and beyond. In Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, 2022.
  68. Nesterov Yurri. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Norwell, 2004.
  69. FedPD: A federated learning framework with adaptivity to non-iid data. IEEE Trans. Signal Process., 69:6055–6070, 2021.
  70. BEER: fast O(1/T) rate for decentralized nonconvex optimization with communication compression. In Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, 2022.
  71. Global momentum compression for sparse communication in distributed sgd. arXiv preprint arXiv:1905.12948, 2019.
  72. Federated learning with non-IID data. arXiv preprint arXiv:1806.00582, 2018.
Citations (170)

Summary

Stochastic Controlled Averaging for Federated Learning with Communication Compression

Federated Learning (FL) has emerged as a powerful paradigm for training machine learning models across decentralized data sources such as mobile devices and remote sensors. This method promotes data privacy by transmitting model updates from local devices to a central server instead of raw data. Despite its advantages, FL faces challenges due to communication overhead, data heterogeneity, and partial client participation. Communication compression, aimed at alleviating these issues, introduces additional complexities due to information distortion.

The paper, "Stochastic Controlled Averaging for Federated Learning with Communication Compression," revisits a seminal approach in FL by presenting a more communication-efficient variant. It proposes two algorithms designed to accommodate unbiased and biased compression methods.

Key Contributions

  • Simplified Controlled Averaging: The authors introduce a refined version of stochastic controlled averaging that significantly reduces uplink communication by half. Instead of sending both local model and control variable updates, the formulation allows each client to transmit only one compressed variable per round.
  • Robust to Data Heterogeneity and Partial Participation: Both proposed algorithms accommodate arbitrary data heterogeneity without imposing additional assumptions and demonstrate strong empirical performance even under stringent client participation scenarios.
  • Superior Convergence Rates: The algorithms, namely SCALLION and SCAFCOM, are supported by rigorous theoretical analyses showing state-of-the-art convergence rates comparable to full-precision counterparts.

Numerical Results and Implications

Experiments conducted with widely-used datasets (MNIST, Fashion MNIST) and various compression settings (biased and unbiased) underscore the effectiveness of SCALLION and SCAFCOM. They reveal that the proposed methods achieve results close to full-precision FL but with significantly reduced communication costs — up to 100x compression in certain setups.

  1. Experimental Validation: With proper tuning, SCALLION and SCAFCOM match or outperform existing compressed FL methods under identical constraints. Their robustness to client drift and communication distortion positions them as practical solutions for real-world deployments.
  2. Compression Efficiency: By reducing uplink communication without sacrificing accuracy, these techniques offer feasible pathways to scale FL across larger environments or more resource-constrained networks.

Future Prospects

This research invites further exploration into adaptive compression schemes, privacy-preserving protocols, and hybrid models that might integrate FL with other distributed learning frameworks. Given the ever-increasing demand for privacy and resource-efficient machine learning solutions, these contributions present promising avenues for future advancements in AI systems.

In conclusion, the paper advances federated learning by mitigating critical bottlenecks in communication and client variability, laying the groundwork for more efficient and robust machine learning systems distributed across diverse and decentralized environments.

X Twitter Logo Streamline Icon: https://streamlinehq.com