Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Coordinating Distributed Example Orders for Provably Accelerated Training (2302.00845v5)

Published 2 Feb 2023 in cs.LG, cs.DC, and math.OC

Abstract: Recent research on online Gradient Balancing (GraB) has revealed that there exist permutation-based example orderings for SGD that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training examples, GraB leverages stale gradients from prior epochs to order examples -- achieving a provably faster convergence rate than RR. However, GraB is limited by design: while it demonstrates an impressive ability to scale-up training on centralized data, it does not naturally extend to modern distributed ML workloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights from prior work on kernel thinning to translate the benefits of provably faster permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms distributed RR on a variety of benchmark tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Collective communications library with various primitives for multi-machine training, 2023. URL https://github.com/facebookincubator/gloo.
  2. Discrepancy minimization via a self-balancing walk. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 14–20, 2021.
  3. Targeted separation and convergence with kernel discrepancies. arXiv preprint arXiv:2209.12835, 2022.
  4. Dimitri P. Bertsekas. Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. In Optimization for Machine Learning. The MIT Press, 2011.
  5. Léon Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
  6. Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  7. Is My Prediction Arbitrary? Measuring Self-Consistency in Fair Classification. arXiv preprint arXiv:2301.11562, 2023.
  8. Christopher De Sa. Random Reshuffling is Not Always Better. In Advances in Neural Information Processing Systems, 2020.
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  10. Kernel thinning. arXiv preprint arXiv:2105.05842, 2021.
  11. Generalized Kernel Thinning. In Tenth International Conference on Learning Representations, 2022.
  12. Automated curriculum learning for neural networks. In international conference on machine learning, pages 1311–1320. PMLR, 2017.
  13. Convergence Rate of Incremental Gradient and Incremental Newton Methods. SIAM Journal on Optimization, 29(4):2542–2565, 2019.
  14. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186(1):49–84, 2021.
  15. Random Shuffling Beats SGD after Finite Epochs. In Proceedings of the International Conference on Machine Learning, volume 97, pages 2624–2633, 2019.
  16. Near-Optimal Herding. In Proceedings of The 27th Conference on Learning Theory, volume 35, pages 1165–1182, 2014.
  17. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  18. Distributed Random Reshuffling over Networks. arXiv preprint arXiv:2112.15287, 2021.
  19. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.
  20. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, page 583–598, USA, 2014. USENIX Association. ISBN 9781931971164.
  21. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  22. A General Analysis of Example-Selection for Stochastic Gradient Descent. In International Conference on Learning Representations, 2021a.
  23. Variance Reduced Training with Stratified Sampling for Forecasting Models. In Proceedings of the International Conference on Machine Learning, pages 7145–7155. PMLR, 2021b.
  24. GraB: Finding Provably Better Data Permutations than Random Reshuffling. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=nDemfqKHTpK.
  25. The m4 competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting, 36(1):54–74, 2020. ISSN 0169-2070. doi: https://doi.org/10.1016/j.ijforecast.2019.04.014. URL https://www.sciencedirect.com/science/article/pii/S0169207019301128. M4 Competition.
  26. Server-Side Stepsizes and Sampling Without Replacement Provably Help in Federated Optimization. arXiv preprint arXiv:2201.11066, 2022.
  27. Teacher–student curriculum learning. IEEE transactions on neural networks and learning systems, 31(9):3732–3740, 2019.
  28. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  29. Regularizing and optimizing lstm language models. In International Conference on Learning Representations, 2018.
  30. Random Reshuffling: Simple Analysis with Vast Improvements. In Advances in Neural Information Processing Systems, 2020.
  31. Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods. arXiv preprint arXiv:2202.01838, 2022.
  32. Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm. In Advances in Neural Information Processing Systems, pages 1017–1025, 2014.
  33. NVIDIA. NVIDIA Collective Communication Library, 2023. URL https://https://developer.nvidia.com/nccl.
  34. PyTorch Contributors. DataLoader API, 2023. URL https://pytorch.org/docs/stable/data.html.
  35. Language models are unsupervised multitask learners. 2019.
  36. Permutation-Based SGD: Is Random Optimal? In International Conference on Learning Representations, 2022.
  37. Toward a Noncommutative Arithmetic-geometric Mean Inequality: Conjectures, Case-studies, and Consequences. In Conference on Learning Theory, volume 23, pages 11.1–11.24, 2012.
  38. Federated Optimization Algorithms with Random Reshuffling and Gradient Compression. arXiv preprint arXiv:2206.07021, 2022.
  39. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
  40. Don’t Decay the Learning Rate, Increase the Batch Size. In International Conference on Learning Representations, 2018.
  41. Curriculum learning: A survey. International Journal of Computer Vision, pages 1–40, 2022.
  42. Are forecasting competitions data representative of the reality? International Journal of Forecasting, 36(1):37–53, 2020.
  43. Pointer sentinel mixture models. Proceedings of ICLR, 2017.
  44. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446.
  45. Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128, 2009.
  46. On the performance of random reshuffling in stochastic learning. In 2017 Information Theory and Applications Workshop (ITA), pages 1–5. IEEE, 2017.
  47. Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems, 35:25464–25477, 2022.
  48. Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond. In International Conference on Learning Representations, 2021a.
  49. Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD? In Conference on Learning Theory, 2021b.
Citations (4)

Summary

We haven't generated a summary for this paper yet.