Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Retrospective Approximation Approach for Smooth Stochastic Optimization (2103.04392v3)

Published 7 Mar 2021 in math.OC and stat.ML

Abstract: Stochastic Gradient (SG) is the defacto iterative technique to solve stochastic optimization (SO) problems with a smooth (non-convex) objective $f$ and a stochastic first-order oracle. SG's attractiveness is due in part to its simplicity of executing a single step along the negative subsampled gradient direction to update the incumbent iterate. In this paper, we question SG's choice of executing a single step as opposed to multiple steps between subsample updates. Our investigation leads naturally to generalizing SG into Retrospective Approximation (RA) where, during each iteration, a "deterministic solver" executes possibly multiple steps on a subsampled deterministic problem and stops when further solving is deemed unnecessary from the standpoint of statistical efficiency. RA thus rigorizes what is appealing for implementation -- during each iteration, "plug in" a solver, e.g., L-BFGS line search or Newton-CG, as is, and solve only to the extent necessary. We develop a complete theory using relative error of the observed gradients as the principal object, demonstrating that almost sure and $L_1$ consistency of RA are preserved under especially weak conditions when sample sizes are increased at appropriate rates. We also characterize the iteration and oracle complexity (for linear and sub-linear solvers) of RA, and identify a practical termination criterion leading to optimal complexity rates. To subsume non-convex $f$, we present a certain "random central limit theorem" that incorporates the effect of curvature across all first-order critical points, demonstrating that the asymptotic behavior is described by a certain mixture of normals. The message from our numerical experiments is that the ability of RA to incorporate existing second-order deterministic solvers in a strategic manner might be important from the standpoint of dispensing with hyper-parameter tuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Alexanderian A (2015) A brief note on the karhunen-loève expansion.
  2. Bertsekas DP (2019) Reinforcement Learning and Optimal Control (Athena Scientific).
  3. Billingsley P (1995) Probability and Measure (New York, NY: Wiley).
  4. Cartis C, Scheinberg K (2018) Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming 169(2):337–375.
  5. Chen H, Schmeiser BW (2001) Stochastic root finding via retrospective approximation. IIE Transactions 33:259–275.
  6. Delyon B (2009) Exponential inequalities for sums of weakly dependent variables. Electronic journal of probability 14(none):752–779, ISSN 1083-6489.
  7. Deng G, Ferris MC (2009) Variable-number sample-path optimization. Mathematical Programming (117):81–109.
  8. Dereich S, Kassing S (2021) Convergence of stochastic gradient descent schemes for lojasiewicz-landscapes. arXiv preprint arXiv:2102.09385 .
  9. Homem-de-Mello T (2003) Variable-sample methods for stochastic optimization. ACM Transactions on Modeling and Computer Simulation 13:108–133.
  10. Huber PJ (1992) Robust estimation of a location parameter. Breakthroughs in statistics, 492–518 (Springer).
  11. Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems 26.
  12. Khaled A, Richtárik P (2020) Better theory for sgd in the nonconvex world. arXiv preprint arXiv:2002.03329 .
  13. Kingma DP, Ba J (2014) ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
  14. Lan G (2011) An optimal method for stochastic composite optimization. Mathematical programming 133(1-2):365–397, ISSN 0025-5610.
  15. Lydia A, Francis S (2019) Adagrad—an optimizer for stochastic gradient descent. Int. J. Inf. Comput. Sci 6(5):566–568.
  16. Mokhtari A, Ribeiro A (2015) Global convergence of online limited memory bfgs. Journal of Machine Learning Research 16(1):3151–3181.
  17. Nelson BL (2013) Foundations and Methods of Stochastic Simulation: A First Course (New York, NY: Springer).
  18. Pasupathy R (2010) On choosing parameters in retrospective-approximation algorithms for stochastic root finding and simulation optimization. Operations Research 58:889–901.
  19. Pasupathy R, Song Y (2021) An adaptive sequential sample average approximation framework for solving two-stage stochastic programs. SIAM Journal on Optimization 31(1):1017–1048.
  20. Polak E, Royset J (2008) Efficient sample sizes in stochastic nonlinear programming. Journal of Computational and Applied Mathematics 217(2):301–310.
  21. Polyak BT (1963) Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics 3(4):864–878.
  22. Polyak BT, Juditsky AB (1992) Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization 30(4):838–855.
  23. Robbins H, Monro S (1951) A stochastic approximation method. Annals of Mathematical Statistics 22:400–407.
  24. Rubinstein RY, Shapiro A (1990) Optimization of static simulation models by the score function method. Mathematics and Computers in Simulation 32:373–392.
  25. Rustagi JS (2014) Optimization techniques in statistics (Elsevier).
  26. Shapiro A (1991) Asymptotic analysis of stochastic programs. Annals of Operations Research 30:169–186.
  27. Shapiro A (1993) Asymptotic behavior of optimal solutions in stochastic programming. Mathematics of Operations Research 18(4):829–845.
  28. Shapiro A, Kleywegt A (2001) Minimax analysis of stochastic problems. Optimization Methods and Software 17:523–542.
  29. Stefanski LA, Boos DD (2002) The calculus of m-estimation. The American Statistician 56(1):29–38.
  30. Talagrand M (1996) New concentration inequalities in product spaces. Inventiones mathematicae 126(3):505–563.
  31. Trosset M (2009) An Introduction to Statistical Inference and Its Applications with R (New Jersey: CRC/Chapman & Hall).
  32. Vapnik VN (1995) The nature of statistical learning theory (New York: Springer), ISBN 1-4757-2440-3.
  33. Vershynin R (2018) High-Dimensional Probability: An Introduction with Applications in Data Science (Cambridge Series in Statistical and Probabilistic Mathematics).
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets