Ordering for Non-Replacement SGD (2306.15848v1)
Abstract: One approach for reducing run time and improving efficiency of machine learning is to reduce the convergence rate of the optimization algorithm used. Shuffling is an algorithm technique that is widely used in machine learning, but it only started to gain attention theoretically in recent years. With different convergence rates developed for random shuffling and incremental gradient descent, we seek to find an ordering that can improve the convergence rates for the non-replacement form of the algorithm. Based on existing bounds of the distance between the optimal and current iterate, we derive an upper bound that is dependent on the gradients at the beginning of the epoch. Through analysis of the bound, we are able to develop optimal orderings for constant and decreasing step sizes for strongly convex and convex functions. We further test and verify our results through experiments on synthesis and real data sets. In addition, we are able to combine the ordering with mini-batch and further apply it to more complex neural networks, which show promising results.
- Pankaj K Agarwal, Sariel Har-Peled and Kasturi R Varadarajan “Approximating extent measures of points” In Journal of the ACM (JACM) 51.4 ACM New York, NY, USA, 2004, pp. 606–635
- Kwangjun Ahn, Chulhee Yun and Suvrit Sra “SGD with shuffling: optimal rates without component convexity and large epoch requirements” In Advances in Neural Information Processing Systems 33, 2020, pp. 17526–17535
- “Variance Reduction in SGD by Distributed Importance Sampling”, 2015 eprint: arXiv:1511.06481
- “A generalization of Polyak’s convergence result for subgradient optimization” In Mathematical Programming 37.3 Springer, 1987, pp. 309–317
- Zeyuan Allen-Zhu “Katyusha: The first direct acceleration of stochastic gradient methods” In The Journal of Machine Learning Research 18.1 JMLR. org, 2017, pp. 8194–8244
- Zeyuan Allen-Zhu, Yang Yuan and Karthik Sridharan “Exploiting the structure: Stochastic gradient methods using raw clusters” In Advances in Neural Information Processing Systems 29, 2016
- Hilal Asi and John C Duchi “The importance of better models in stochastic optimization” In Proceedings of the National Academy of Sciences 116.46 National Acad Sciences, 2019, pp. 22924–22930
- Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization” In arXiv preprint arXiv:1607.06450, 2016
- Francis Bach “Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression” In The Journal of Machine Learning Research 15.1 JMLR. org, 2014, pp. 595–627
- Mokhtar S Bazaraa and Hanif D Sherali “On the choice of step size in subgradient optimization” In European Journal of Operational Research 7.4 Elsevier, 1981, pp. 380–388
- Dimitri Bertsekas “Convex optimization algorithms” Athena Scientific, 2015
- Dimitri P Bertsekas “Incremental least squares methods and the extended Kalman filter” In SIAM Journal on Optimization 6.3 SIAM, 1996, pp. 807–822
- Dimitri P Bertsekas “A hybrid incremental gradient method for least squares” In SIAM Journal on Optimization 7, 1997, pp. 913–926
- Dimitri P Bertsekas “Nonlinear programming” In Journal of the Operational Research Society 48.3 Taylor & Francis, 1997, pp. 334–334
- Dimitri P Bertsekas “Incremental gradient, subgradient, and proximal methods for convex optimization: A survey” In Optimization for Machine Learning 2010.1-38, 2011, pp. 3
- Dimitri P Bertsekas and John N Tsitsiklis “Neuro-dynamic programming: an overview” In Proceedings of 1995 34th IEEE conference on decision and control 1, 1995, pp. 560–564 IEEE
- Dimitri P Bertsekas and John N Tsitsiklis “Gradient convergence in gradient methods with errors” In SIAM Journal on Optimization 10.3 SIAM, 2000, pp. 627–642
- Léon Bottou and Yann Le Cun “On-line learning for very large data sets” In Applied stochastic models in business and industry 21.2 Wiley Online Library, 2005, pp. 137–151
- “Distributed optimization and statistical learning via the alternating direction method of multipliers” In Foundations and Trends® in Machine learning 3.1 Now Publishers, Inc., 2011, pp. 1–122
- Ulf Brannlund “On relaxation methods for nonsmooth convex optimization.”, 1995
- “Bayesian coreset construction via greedy iterative geodesic ascent” In International Conference on Machine Learning, 2018, pp. 698–706 PMLR
- Kai Lai Chung “On a stochastic approximation method” In The Annals of Mathematical Statistics JSTOR, 1954, pp. 463–483
- Michael B Cohen, Cameron Musco and Christopher Musco “Input sparsity time low-rank approximation via ridge leverage score sampling” In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, 2017, pp. 1758–1777 SIAM
- “Convergence of some algorithms for convex minimization” In Mathematical Programming 62.1 Springer, 1993, pp. 261–275
- Aaron Defazio, Francis Bach and Simon Lacoste-Julien “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives” In Advances in neural information processing systems 27, 2014
- “On the ineffectiveness of variance reduced optimization for deep learning” In Advances in Neural Information Processing Systems 32, 2019
- “Nondifferentiable Optimization, Optimization Software” In Inc. Publications Division, New York, 1985
- John Duchi, Elad Hazan and Yoram Singer “Adaptive subgradient methods for online learning and stochastic optimization.” In Journal of machine learning research 12.7, 2011
- Yu M Ermol’ev “Methods of solution of nonlinear extremal problems” In Cybernetics 2.4 Springer, 1966, pp. 1–14
- Yu Ermoliev “Stochastic programming methods” Nauka, 1976
- Yu M Ermoliev “On the stochastic quasi-gradient method and stochastic quasi-Feyer sequences” In Kibernetika 2, 1969, pp. 72–83
- Yu M Ermoliev and RJ-B Wets “Numerical techniques for stochastic optimization” Springer-Verlag, 1988
- Yuri Ermoliev “Stochastic quasigradient methods and their application to system optimization” In Stochastics: An International Journal of Probability and Stochastic Processes 9.1-2 Taylor & Francis, 1983, pp. 1–36
- “Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization” In International Conference on Machine Learning, 2015, pp. 2540–2548 PMLR
- Alexei A Gaivoronski “Convergence properties of backpropagation for neural nets via theory of stochastic gradient methods. Part 1” In Optimization methods and Software 4.2 Taylor & Francis, 1994, pp. 117–134
- “Understanding the difficulty of training deep feedforward neural networks” In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256 JMLR WorkshopConference Proceedings
- Jean-Louis Goffin “The relaxation method for solving systems of linear inequalities” In Mathematics of Operations Research 5.3 INFORMS, 1980, pp. 388–414
- Jean-Louis Goffin and Krzysztof C Kiwiel “Convergence of a simple subgradient level method” In Mathematical Programming 85.1 Springer-Verlag, 1999, pp. 207–211
- Luigi Grippo “A class of unconstrained minimization methods for neural network training” In Optimization Methods and Software 4.2 Taylor & Francis, 1994, pp. 135–150
- M Gurbuzbalaban, Asu Ozdaglar and Pablo A Parrilo “Convergence rate of incremental gradient and incremental newton methods” In SIAM Journal on Optimization 29.4 SIAM, 2019, pp. 2542–2565
- Mert Gurbuzbalaban, Asuman Ozdaglar and Pablo A Parrilo “On the convergence rate of incremental aggregated gradient algorithms” In SIAM Journal on Optimization 27.2 SIAM, 2017, pp. 1035–1048
- Mert Gürbüzbalaban, Asuman Ozdaglar and Pablo Parrilo “A globally convergent incremental Newton method” In Mathematical Programming 151.1 Springer, 2015, pp. 283–313
- Mert Gürbüzbalaban, Asuman Ozdaglar and Pablo Parrilo “Why Random Reshuffling Beats Stochastic Gradient Descent” In arXiv preprint arXiv:1510.08560, 2015
- “Random shuffling beats sgd after finite epochs” In International Conference on Machine Learning, 2019, pp. 2624–2633 PMLR
- “On coresets for k-means and k-median clustering” In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, 2004, pp. 291–300
- Elad Hazan, Amit Agarwal and Satyen Kale “Logarithmic regret algorithms for online convex optimization” In Machine Learning 69.2 Springer, 2007, pp. 169–192
- “Convex analysis and minimization algorithms I: Fundamentals” Springer science & business media, 2013
- “Variance reduced stochastic gradient descent with neighbors” In Advances in Neural Information Processing Systems 28, 2015
- “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” In CoRR abs/1502.03167, 2015
- “Accelerating stochastic gradient descent using predictive variance reduction” In Advances in neural information processing systems 26, 2013
- Christos A Kaskavelis and Michael C Caramanis “Efficient Lagrangian relaxation algorithms for industry size job-shop scheduling problems” In IIE transactions 30.11 Springer, 1998, pp. 1085–1097
- “Not all samples are created equal: Deep learning with importance sampling” In International conference on machine learning, 2018, pp. 2525–2534 PMLR
- L Kaufman, PJ Rousseeuw and Y Dodge “Clustering by means of medoids in statistical data analysis based on the” In L1 Norm,~ orth-Holland, Amsterdam, 1987
- “Ordered SGD: A new stochastic optimization framework for empirical risk minimization” In International Conference on Artificial Intelligence and Statistics, 2020, pp. 669–679 PMLR
- Sehun Kim, Hyunsil Ahn and Seong-Cheol Cho “Variable target value subgradient method” In Mathematical Programming 49.1 Springer, 1990, pp. 359–369
- “An improved subgradient method for constrained nondifferentiable optimization” In Operations Research Letters 14.1 Elsevier, 1993, pp. 61–64
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
- Krzysztof C Kiwiel “The efficiency of subgradient projection methods for convex optimization, part I: General level methods” In SIAM Journal on Control and Optimization 34.2 SIAM, 1996, pp. 660–676
- Krzysztof C Kiwiel “The efficiency of subgradient projection methods for convex optimization, part II: Implementations and extensions” In SIAM Journal on Control and Optimization 34.2 SIAM, 1996, pp. 677–697
- Krzysztof C Kiwiel, Torbjörn Larsson and Per Olov Lindberg “The efficiency of ballstep subgradient level methods for convex optimization” In Mathematics of Operations Research 24.1 INFORMS, 1999, pp. 237–254
- Teuvo Kohonen “An adaptive associative memory principle” In IEEE Transactions on Computers 100.4 IEEE, 1974, pp. 444–445
- Anatolii Nikolaevich Kulikov and Valeriy Raufovich Fazylov “Convex optimization with prescribed accuracy” In USSR computational mathematics and mathematical physics 30.3 Elsevier, 1990, pp. 16–22
- Mu Li, Gary L Miller and Richard Peng “Iterative row sampling” In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 2013, pp. 127–136 IEEE
- Hongzhou Lin, Julien Mairal and Zaid Harchaoui “A universal catalyst for first-order optimization” In Advances in neural information processing systems 28, 2015
- Hui Lin, Jeff Bilmes and Shasha Xie “Graph-based submodular selection for extractive summarization” In 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, 2009, pp. 381–386 IEEE
- Hui Lin and Jeff A Bilmes “Learning mixtures of submodular shells with application to document summarization” In arXiv preprint arXiv:1210.4871, 2012
- Yucheng Lu, Si Yi Meng and Christopher De Sa “A General Analysis of Example-selection for Stochastic Gradient Descent” In conference paper at ICLR 2022, 2022
- “Training gaussian mixture models at scale via coresets” In The Journal of Machine Learning Research 18.1 JMLR. org, 2017, pp. 5885–5909
- Zhi-Quan Luo “On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks” In Neural Computation 3.2 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 1991, pp. 226–245
- Olvi L Mangasarian and Mikhail V Solodov “Backpropagation convergence via deterministic nonmonotone perturbed minimization” In Advances in Neural Information Processing Systems 6, 1993
- Michel Minoux “Accelerated greedy algorithms for maximizing submodular set functions” In Optimization techniques Springer, 1978, pp. 234–243
- Michel Minoux “Mathematical programming: theory and algorithms” John Wiley & Sons, 1986
- “Lazier than lazy greedy” In Proceedings of the AAAI Conference on Artificial Intelligence 29.1, 2015
- Baharan Mirzasoleiman, Jeff Bilmes and Jure Leskovec “Coresets for data-efficient training of machine learning models” In International Conference on Machine Learning, 2020, pp. 6950–6960 PMLR
- “Distributed submodular cover: Succinctly summarizing massive data” In Advances in Neural Information Processing Systems 28, 2015
- Baharan Mirzasoleiman, Morteza Zadimoghaddam and Amin Karbasi “Fast distributed submodular cover: Public-private data summarization” In Advances in Neural Information Processing Systems 29, 2016
- Konstantin Mishchenko, Ahmed Khaled Ragab Bayoumi and Peter Richtarik “Random Reshuffling: Simple Analysis with Vast Improvements” In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020 URL: https://proceedings.neurips.cc/paper/2020/hash/c8cc6e90ccbff44c9cee23611711cdc4-Abstract.html
- Amirkeivan Mohtashami, Sebastian U. Stich and Martin Jaggi “Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods” In ArXiv abs/2202.01838, 2022
- Hiroyuki Moriyama, Nobuo Yamashita and Masao Fukushima “The incremental Gauss-Newton algorithm with adaptive stepsize rule” In Computational Optimization and Applications 26.2 Springer, 2003, pp. 107–141
- “Non-asymptotic analysis of stochastic approximation algorithms for machine learning” In Advances in neural information processing systems 24, 2011
- “Recursive sampling for the nystrom method” In Advances in Neural Information Processing Systems 30, 2017
- Angelia Nedic and Dimitri P Bertsekas “Incremental subgradient methods for nondifferentiable optimization” In SIAM Journal on Optimization 12.1 SIAM, 2001, pp. 109–138
- “On the rate of convergence of distributed subgradient methods for multi-agent optimization” In 2007 46th IEEE Conference on Decision and Control, 2007, pp. 4711–4716 IEEE
- “Distributed subgradient methods for multi-agent optimization” In IEEE Transactions on Automatic Control 54.1 IEEE, 2009, pp. 48–61
- “Convergence rate of incremental subgradient algorithms” In Stochastic optimization: algorithms and applications Springer, 2001, pp. 223–264
- Deanna Needell, Rachel Ward and Nati Srebro “Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm” In Advances in neural information processing systems 27, 2014
- George L Nemhauser, Laurence A Wolsey and Marshall L Fisher “An analysis of approximations for maximizing submodular set functions—I” In Mathematical programming 14.1 Springer, 1978, pp. 265–294
- “Robust stochastic approximation approach to stochastic programming” In SIAM Journal on optimization 19.4 SIAM, 2009, pp. 1574–1609
- Yurii Nesterov “Introductory lectures on convex optimization: a basic course”, 2004
- “A Unified Convergence Analysis for Shuffling-Type Gradient Methods” In Journal of Machine Learning Research, 2021, pp. 1–44
- Elijah Polak “On the mathematical foundations of nondifferentiable optimization in engineering design” In SIAM review 29.1 SIAM, 1987, pp. 21–89
- Boris T Polyak “Introduction to optimization. optimization software” In Inc., Publications Division, New York 1, 1987, pp. 32
- Boris Teodorovich Polyak “A general method for solving extremal problems” In Doklady Akademii Nauk 174.1, 1967, pp. 33–36 Russian Academy of Sciences
- Boris Teodorovich Polyak “Minimization of unsmooth functionals” In USSR Computational Mathematics and Mathematical Physics 9.3 Elsevier, 1969, pp. 14–29
- Ning Qian “On the momentum term in gradient descent learning algorithms” In Neural networks 12.1 Elsevier, 1999, pp. 145–151
- Shashank Rajput, Anant Gupta and Dimitris Papailiopoulos “Closing the convergence gap of SGD without replacement” In ICML’20: Proceedings of the 37th International Conference on Machine Learning, 2020, pp. 7964–7973
- Shashank Rajput, Kangwook Lee and Dimitris Papailiopoulos “Permutation-Based SGD: Is Random Optimal?” In International Conference on Learning Representations, 2022 URL: https://openreview.net/forum?id=YiBa9HKTyXE
- Alexander Rakhlin, Ohad Shamir and Karthik Sridharan “Making gradient descent optimal for strongly convex stochastic optimization” In arXiv preprint arXiv:1109.5647, 2011
- S Sundhar Ram, A Nedic and VV Veeravalli “Stochastic incremental gradient descent for estimation in sensor networks” In 2007 conference record of the forty-first asilomar conference on signals, systems and computers, 2007, pp. 582–586 IEEE
- R Tyrrell Rockafellar “Convex Analysis” Citeseer, 1970
- Nicolas Roux, Mark Schmidt and Francis Bach “A stochastic gradient method with an exponential convergence _rate for finite training sets” In Advances in neural information processing systems 25, 2012
- “How good is SGD with random shuffling?” In Conference on Learning Theory, 2020, pp. 3250–3284 PMLR
- Vatsal Shah, Xiaoxia Wu and Sujay Sanghavi “Choosing the Sample with Lowest Loss makes SGD Robust” In International Conference on Artificial Intelligence and Statistics, 2020
- “Stochastic dual coordinate ascent methods for regularized loss minimization.” In Journal of Machine Learning Research 14.2, 2013
- Ohad Shamir “Without-Replacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization”, 2016 eprint: arXiv:1603.00570
- NZ Shor, Krzysztof C Kiwiel and Andrzej Ruszcaynski “Minimization methods for non-differentiable functions” Springer-Verlag, 1985
- Jascha Sohl-Dickstein, Ben Poole and Surya Ganguli “Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods” In International Conference on Machine Learning, 2014, pp. 604–612 PMLR
- Mikhail V Solodov “Incremental gradient algorithms with stepsizes bounded away from zero” In Computational Optimization and Applications 11.1 Springer, 1998, pp. 23–35
- Mikhail V Solodov and SK Zavriev “Error stability properties of generalized gradient-type algorithms” In Journal of Optimization Theory and Applications 98.3 Springer, 1998, pp. 663–680
- “MLI: An API for distributed machine learning” In 2013 IEEE 13th International Conference on Data Mining, 2013, pp. 1187–1192 IEEE
- Sebastian U. Stich, Anant Raj and Martin Jaggi “Safe Adaptive Importance Sampling”, 2017 eprint: arXiv:1711.02637
- Emma Strubell, Ananya Ganesh and Andrew McCallum “Energy and policy considerations for deep learning in NLP” In arXiv preprint arXiv:1906.02243, 2019
- Paul Tseng “An incremental gradient (-projection) method with momentum term and adaptive stepsize rule” In SIAM Journal on Optimization 8.2 SIAM, 1998, pp. 506–531
- Kai Wei, Rishabh Iyer and Jeff Bilmes “Submodularity in data subset selection and active learning” In International Conference on Machine Learning, 2015, pp. 1954–1963 PMLR
- Laurence A Wolsey “An analysis of the greedy algorithm for the submodular set covering problem” In Combinatorica 2.4 Springer, 1982, pp. 385–393
- “A proximal stochastic gradient method with progressive variance reduction” In SIAM Journal on Optimization 24.4 SIAM, 2014, pp. 2057–2075
- Chulhee Yun, Suvrit Sra and Ali Jadbabaie “Open Problem: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?” In Conference on Learning Theory, 2021, pp. 4653–4658 PMLR
- Matthew D Zeiler “Adadelta: an adaptive learning rate method” In arXiv preprint arXiv:1212.5701, 2012
- Xing Zhao, Peter B Luh and Jihua Wang “Surrogate gradient algorithm for Lagrangian relaxation” In Journal of optimization Theory and Applications 100.3 Springer, 1999, pp. 699–712
- “Analysis of an approximate gradient projection method with applications to the backpropagation algorithm” In Optimization Methods and Software 4.2 Taylor & Francis, 1994, pp. 85–101