Coordinating Distributed Example Orders for Provably Accelerated Training (2302.00845v5)
Abstract: Recent research on online Gradient Balancing (GraB) has revealed that there exist permutation-based example orderings for SGD that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training examples, GraB leverages stale gradients from prior epochs to order examples -- achieving a provably faster convergence rate than RR. However, GraB is limited by design: while it demonstrates an impressive ability to scale-up training on centralized data, it does not naturally extend to modern distributed ML workloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights from prior work on kernel thinning to translate the benefits of provably faster permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms distributed RR on a variety of benchmark tasks.
- Collective communications library with various primitives for multi-machine training, 2023. URL https://github.com/facebookincubator/gloo.
- Discrepancy minimization via a self-balancing walk. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 14–20, 2021.
- Targeted separation and convergence with kernel discrepancies. arXiv preprint arXiv:2209.12835, 2022.
- Dimitri P. Bertsekas. Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. In Optimization for Machine Learning. The MIT Press, 2011.
- Léon Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
- Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Is My Prediction Arbitrary? Measuring Self-Consistency in Fair Classification. arXiv preprint arXiv:2301.11562, 2023.
- Christopher De Sa. Random Reshuffling is Not Always Better. In Advances in Neural Information Processing Systems, 2020.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
- Kernel thinning. arXiv preprint arXiv:2105.05842, 2021.
- Generalized Kernel Thinning. In Tenth International Conference on Learning Representations, 2022.
- Automated curriculum learning for neural networks. In international conference on machine learning, pages 1311–1320. PMLR, 2017.
- Convergence Rate of Incremental Gradient and Incremental Newton Methods. SIAM Journal on Optimization, 29(4):2542–2565, 2019.
- Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186(1):49–84, 2021.
- Random Shuffling Beats SGD after Finite Epochs. In Proceedings of the International Conference on Machine Learning, volume 97, pages 2624–2633, 2019.
- Near-Optimal Herding. In Proceedings of The 27th Conference on Learning Theory, volume 35, pages 1165–1182, 2014.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Distributed Random Reshuffling over Networks. arXiv preprint arXiv:2112.15287, 2021.
- Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.
- Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, page 583–598, USA, 2014. USENIX Association. ISBN 9781931971164.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- A General Analysis of Example-Selection for Stochastic Gradient Descent. In International Conference on Learning Representations, 2021a.
- Variance Reduced Training with Stratified Sampling for Forecasting Models. In Proceedings of the International Conference on Machine Learning, pages 7145–7155. PMLR, 2021b.
- GraB: Finding Provably Better Data Permutations than Random Reshuffling. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=nDemfqKHTpK.
- The m4 competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting, 36(1):54–74, 2020. ISSN 0169-2070. doi: https://doi.org/10.1016/j.ijforecast.2019.04.014. URL https://www.sciencedirect.com/science/article/pii/S0169207019301128. M4 Competition.
- Server-Side Stepsizes and Sampling Without Replacement Provably Help in Federated Optimization. arXiv preprint arXiv:2201.11066, 2022.
- Teacher–student curriculum learning. IEEE transactions on neural networks and learning systems, 31(9):3732–3740, 2019.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
- Regularizing and optimizing lstm language models. In International Conference on Learning Representations, 2018.
- Random Reshuffling: Simple Analysis with Vast Improvements. In Advances in Neural Information Processing Systems, 2020.
- Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods. arXiv preprint arXiv:2202.01838, 2022.
- Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm. In Advances in Neural Information Processing Systems, pages 1017–1025, 2014.
- NVIDIA. NVIDIA Collective Communication Library, 2023. URL https://https://developer.nvidia.com/nccl.
- PyTorch Contributors. DataLoader API, 2023. URL https://pytorch.org/docs/stable/data.html.
- Language models are unsupervised multitask learners. 2019.
- Permutation-Based SGD: Is Random Optimal? In International Conference on Learning Representations, 2022.
- Toward a Noncommutative Arithmetic-geometric Mean Inequality: Conjectures, Case-studies, and Consequences. In Conference on Learning Theory, volume 23, pages 11.1–11.24, 2012.
- Federated Optimization Algorithms with Random Reshuffling and Gradient Compression. arXiv preprint arXiv:2206.07021, 2022.
- Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
- Don’t Decay the Learning Rate, Increase the Batch Size. In International Conference on Learning Representations, 2018.
- Curriculum learning: A survey. International Journal of Computer Vision, pages 1–40, 2022.
- Are forecasting competitions data representative of the reality? International Journal of Forecasting, 36(1):37–53, 2020.
- Pointer sentinel mixture models. Proceedings of ICLR, 2017.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446.
- Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128, 2009.
- On the performance of random reshuffling in stochastic learning. In 2017 Information Theory and Applications Workshop (ITA), pages 1–5. IEEE, 2017.
- Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems, 35:25464–25477, 2022.
- Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond. In International Conference on Learning Representations, 2021a.
- Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD? In Conference on Learning Theory, 2021b.