Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards a Better Theoretical Understanding of Independent Subnetwork Training (2306.16484v2)

Published 28 Jun 2023 in cs.LG, cs.DC, and math.OC

Abstract: Modern advancements in large-scale machine learning would be impossible without the paradigm of data-parallel distributed computing. Since distributed computing with large-scale models imparts excessive pressure on communication channels, significant recent research has been directed toward co-designing communication compression strategies and training algorithms with the goal of reducing communication costs. While pure data parallelism allows better data scaling, it suffers from poor model scaling properties. Indeed, compute nodes are severely limited by memory constraints, preventing further increases in model size. For this reason, the latest achievements in training giant neural network models also rely on some form of model parallelism. In this work, we take a closer theoretical look at Independent Subnetwork Training (IST), which is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication, and provide a precise analysis of its optimization performance on a quadratic model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. On the convergence of SGD with biased gradients. arXiv preprint arXiv:2008.00051, 2020.
  2. FedRolex: Model-heterogeneous federated learning with rolling sub-model extraction. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=OtxyysUdBE.
  3. QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30, 2017.
  4. A tight convergence analysis for stochastic gradient descent with delayed updates. In Algorithmic Learning Theory, pages 111–132. PMLR, 2020.
  5. Expanding the reach of federated learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210, 2018.
  6. Federated select: A primitive for communication-and memory-efficient federated learning. arXiv preprint arXiv:2208.09432, 2022.
  7. Optimization with access to auxiliary information. arXiv preprint arXiv:2206.00395, 2022.
  8. Fedobd: Opportunistic block dropout for efficiently training large-scale neural networks through federated learning. arXiv preprint arXiv:2208.05174, 2022.
  9. Distributed fixed point methods with compressed iterates. arXiv preprint arXiv:2102.07245, 2019.
  10. Only tails matter: Average-case universality and robustness in the convex regime. In International Conference on Machine Learning, pages 4474–4491. PMLR, 2022.
  11. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012.
  12. HeteroFL: Computation and communication efficient federated learning for heterogeneous clients. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=TNkPBBYFkXg.
  13. ResIST: Layer-wise decomposition of resnets for distributed training. In Uncertainty in Artificial Intelligence, pages 610–620. PMLR, 2022.
  14. Efficient and light-weight federated learning via asynchronous distributed dropout. In International Conference on Artificial Intelligence and Statistics, pages 6630–6660. PMLR, 2023.
  15. Parallel neural network training on multi-spert. In Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing, pages 659–666. IEEE, 1997.
  16. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. In International Conference on Artificial Intelligence and Statistics, pages 680–690. PMLR, 2020.
  17. Super-acceleration with cyclical step-sizes. In International Conference on Artificial Intelligence and Statistics, pages 3028–3065. PMLR, 2022.
  18. SGD: General analysis and improved rates. Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, 2019.
  19. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677, 2018.
  20. FjORD: Fair and accurate federated learning under heterogeneous targets with ordered dropout. Advances in Neural Information Processing Systems, 34:12876–12889, 2021.
  21. Model pruning enables efficient federated learning on edge devices. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  22. Advances and open problems in federated learning. Found. Trends Mach. Learn., 14(1-2):1–210, 2021. doi: 10.1561/2200000083. URL https://doi.org/10.1561/2200000083.
  23. Gradient descent with compressed iterates. arXiv preprint arXiv:1909.04716, 2019.
  24. Better theory for SGD in the nonconvex world. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=AU4qHN2VkS. Survey Certification.
  25. Tighter theory for local SGD on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 4519–4529. PMLR, 2020.
  26. Distributed learning with compressed gradients. arXiv preprint arXiv:1806.06573, 2018.
  27. Federated learning: Strategies for improving communication efficiency. NIPS Private Multi-Party Machine Learning Workshop, 2016.
  28. On the convergence of shallow neural network training with randomly masked neurons. Transactions on Machine Learning Research, 2022. URL https://openreview.net/forum?id=e7mYYMSyZH.
  29. Federated pruning: Improving neural network efficiency with federated learning. arXiv preprint arXiv:2209.06359, 2022.
  30. Dynamic model pruning with feedback. In International Conference on Learning Representations, 2019.
  31. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  32. ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally! In International Conference on Machine Learning, pages 15750–15769. PMLR, 2022.
  33. Masked training of neural networks with partial gradients. In International Conference on Artificial Intelligence and Statistics, pages 5876–5890. PMLR, 2022.
  34. Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  35. ZeroFL: Efficient on-device training for federated learning with local sparsity. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=2sDQwC_hmnM.
  36. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1–38, 2014.
  37. Distributed coordinate descent method for learning with big data. Journal of Machine Learning Research, 17(75):1–25, 2016.
  38. Smoothness matrices beat smoothness constants: Better communication compression techniques for distributed optimization. Advances in Neural Information Processing Systems, 34:25688–25702, 2021.
  39. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  40. Shifted compression framework: Generalizations and improvements. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022.
  41. Permutation compressors for provably faster distributed nonconvex optimization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=GugZ5DzzAu.
  42. Theoretically better and numerically faster distributed optimization with smoothness-aware quantization techniques. Advances in Neural Information Processing Systems, 35:9841–9852, 2022.
  43. A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
  44. Federated dropout—a simple approach for enabling federated learning on resource constrained devices. IEEE Wireless Communications Letters, 11(5):923–927, 2022.
  45. GIST: Distributed training for large-scale graph convolutional networks. arXiv preprint arXiv:2102.10424, 2021.
  46. Partial variable training for efficient on-device federated learning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4348–4352. IEEE, 2022.
  47. Distributed learning of fully connected neural networks using independent subnet training. Proceedings of the VLDB Endowment, 15(8):1581–1590, 2022.
  48. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019.
  49. An efficient implementation of the back-propagation algorithm on the connection machine CM-2. Advances in neural information processing systems, 2, 1989.
  50. On the convergence of heterogeneous federated learning with arbitrary adaptive online model pruning. arXiv preprint arXiv:2201.11803, 2022. URL https://openreview.net/forum?id=p3EhUXVMeyn.
  51. Quadratic models for understanding neural network dynamics. arXiv preprint arXiv:2205.11787, 2022.
  52. Parallelized stochastic gradient descent. Advances in neural information processing systems, 23, 2010.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Egor Shulgin (15 papers)
  2. Peter Richtárik (241 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets