A Retrospective Approximation Approach for Smooth Stochastic Optimization (2103.04392v3)
Abstract: Stochastic Gradient (SG) is the defacto iterative technique to solve stochastic optimization (SO) problems with a smooth (non-convex) objective $f$ and a stochastic first-order oracle. SG's attractiveness is due in part to its simplicity of executing a single step along the negative subsampled gradient direction to update the incumbent iterate. In this paper, we question SG's choice of executing a single step as opposed to multiple steps between subsample updates. Our investigation leads naturally to generalizing SG into Retrospective Approximation (RA) where, during each iteration, a "deterministic solver" executes possibly multiple steps on a subsampled deterministic problem and stops when further solving is deemed unnecessary from the standpoint of statistical efficiency. RA thus rigorizes what is appealing for implementation -- during each iteration, "plug in" a solver, e.g., L-BFGS line search or Newton-CG, as is, and solve only to the extent necessary. We develop a complete theory using relative error of the observed gradients as the principal object, demonstrating that almost sure and $L_1$ consistency of RA are preserved under especially weak conditions when sample sizes are increased at appropriate rates. We also characterize the iteration and oracle complexity (for linear and sub-linear solvers) of RA, and identify a practical termination criterion leading to optimal complexity rates. To subsume non-convex $f$, we present a certain "random central limit theorem" that incorporates the effect of curvature across all first-order critical points, demonstrating that the asymptotic behavior is described by a certain mixture of normals. The message from our numerical experiments is that the ability of RA to incorporate existing second-order deterministic solvers in a strategic manner might be important from the standpoint of dispensing with hyper-parameter tuning.
- Alexanderian A (2015) A brief note on the karhunen-loève expansion.
- Bertsekas DP (2019) Reinforcement Learning and Optimal Control (Athena Scientific).
- Billingsley P (1995) Probability and Measure (New York, NY: Wiley).
- Cartis C, Scheinberg K (2018) Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming 169(2):337–375.
- Chen H, Schmeiser BW (2001) Stochastic root finding via retrospective approximation. IIE Transactions 33:259–275.
- Delyon B (2009) Exponential inequalities for sums of weakly dependent variables. Electronic journal of probability 14(none):752–779, ISSN 1083-6489.
- Deng G, Ferris MC (2009) Variable-number sample-path optimization. Mathematical Programming (117):81–109.
- Dereich S, Kassing S (2021) Convergence of stochastic gradient descent schemes for lojasiewicz-landscapes. arXiv preprint arXiv:2102.09385 .
- Homem-de-Mello T (2003) Variable-sample methods for stochastic optimization. ACM Transactions on Modeling and Computer Simulation 13:108–133.
- Huber PJ (1992) Robust estimation of a location parameter. Breakthroughs in statistics, 492–518 (Springer).
- Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems 26.
- Khaled A, Richtárik P (2020) Better theory for sgd in the nonconvex world. arXiv preprint arXiv:2002.03329 .
- Kingma DP, Ba J (2014) ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
- Lan G (2011) An optimal method for stochastic composite optimization. Mathematical programming 133(1-2):365–397, ISSN 0025-5610.
- Lydia A, Francis S (2019) Adagrad—an optimizer for stochastic gradient descent. Int. J. Inf. Comput. Sci 6(5):566–568.
- Mokhtari A, Ribeiro A (2015) Global convergence of online limited memory bfgs. Journal of Machine Learning Research 16(1):3151–3181.
- Nelson BL (2013) Foundations and Methods of Stochastic Simulation: A First Course (New York, NY: Springer).
- Pasupathy R (2010) On choosing parameters in retrospective-approximation algorithms for stochastic root finding and simulation optimization. Operations Research 58:889–901.
- Pasupathy R, Song Y (2021) An adaptive sequential sample average approximation framework for solving two-stage stochastic programs. SIAM Journal on Optimization 31(1):1017–1048.
- Polak E, Royset J (2008) Efficient sample sizes in stochastic nonlinear programming. Journal of Computational and Applied Mathematics 217(2):301–310.
- Polyak BT (1963) Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics 3(4):864–878.
- Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization 30(4):838–855.
- Robbins H, Monro S (1951) A stochastic approximation method. Annals of Mathematical Statistics 22:400–407.
- Rubinstein RY, Shapiro A (1990) Optimization of static simulation models by the score function method. Mathematics and Computers in Simulation 32:373–392.
- Rustagi JS (2014) Optimization techniques in statistics (Elsevier).
- Shapiro A (1991) Asymptotic analysis of stochastic programs. Annals of Operations Research 30:169–186.
- Shapiro A (1993) Asymptotic behavior of optimal solutions in stochastic programming. Mathematics of Operations Research 18(4):829–845.
- Shapiro A, Kleywegt A (2001) Minimax analysis of stochastic problems. Optimization Methods and Software 17:523–542.
- Stefanski LA, Boos DD (2002) The calculus of m-estimation. The American Statistician 56(1):29–38.
- Talagrand M (1996) New concentration inequalities in product spaces. Inventiones mathematicae 126(3):505–563.
- Trosset M (2009) An Introduction to Statistical Inference and Its Applications with R (New Jersey: CRC/Chapman & Hall).
- Vapnik VN (1995) The nature of statistical learning theory (New York: Springer), ISBN 1-4757-2440-3.
- Vershynin R (2018) High-Dimensional Probability: An Introduction with Applications in Data Science (Cambridge Series in Statistical and Probabilistic Mathematics).