Derivatives of Stochastic Gradient Descent in parametric optimization (2405.15894v2)
Abstract: We consider stochastic optimization problems where the objective depends on some parameter, as commonly found in hyperparameter optimization for instance. We investigate the behavior of the derivatives of the iterates of Stochastic Gradient Descent (SGD) with respect to that parameter and show that they are driven by an inexact SGD recursion on a different objective function, perturbed by the convergence of the original SGD. This enables us to establish that the derivatives of SGD converge to the derivative of the solution mapping in terms of mean squared error whenever the objective is strongly convex. Specifically, we demonstrate that with constant step-sizes, these derivatives stabilize within a noise ball centered at the solution derivative, and that with vanishing step-sizes they exhibit $O(\log(k)2 / k)$ convergence rates. Additionally, we prove exponential convergence in the interpolation regime. Our theoretical findings are illustrated by numerical experiments on synthetic tasks.
- TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
- Super-efficiency of automatic differentiation for functions defined as a minimum. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 32–41. PMLR, 13–18 Jul 2020.
- Differentiable convex optimization layers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- B. Amos and J. Z. Kolter. OptNet: Differentiable optimization as a layer in neural networks. In ICML, 2017.
- M. Arbel and J. Mairal. Amortized implicit differentiation for stochastic bilevel optimization. In International Conference on Learning Representations, 2021.
- Deep equilibrium models. Advances in Neural Information Processing Systems, 32, 2019.
- Automatic differentiation in machine learning: a survey. J. Mach. Learn. Res., 18(153):1–43, 2018.
- T. Beck. Automatic differentiation of iterative processes. Journal of Computational and Applied Mathematics, 50(1-3):109–118, 1994.
- Implicit differentiation of lasso-type models for hyperparameter optimization. In International Conference on Machine Learning, pages 810–821. PMLR, 2020.
- Efficient and modular implicit differentiation. Advances in neural information processing systems, 35:5230–5242, 2022.
- Convergence of a piggyback-style method for the differentiation of solutions of standard saddle-point problems. SIAM Journal on Mathematics of Data Science, 4(3):1003–1030, 2022.
- Automatic differentiation of nonsmooth iterative algorithms. Advances in Neural Information Processing Systems, 35:26404–26417, 2022.
- One-step differentiation of iterative algorithms. Advances in Neural Information Processing Systems, 36, 2023.
- V. S. Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
- Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- A. Chambolle and T. Pock. Learning consistent discretizations of the total variation. SIAM Journal on Imaging Sciences, 14(2):778–813, 2021.
- Convergence and robustness of the robbins-monro algorithm truncated at randomly varying bounds. Stochastic Processes and their Applications, 27:217–231, 1987.
- Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Advances in Neural Information Processing Systems, 34:25294–25307, 2021.
- B. Christianson. Reverse accumulation and attractive fixed points. Optimization Methods and Software, 3(4):311–326, 1994.
- A lower bound and a near-optimal algorithm for bilevel empirical risk minimization. In AISTATS, pages 82–90. PMLR, 2024.
- A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. In NeurIPS, 2022.
- Stein Unbiased GrAdient estimator of the Risk (SUGAR) for multiple parameter selection. SIAM Journal on Imaging Sciences, 7(4):2448–2487, 2014.
- Clear: Covariant least-square refitting with applications to image restoration. SIAM J. Imaging Sci., 10(1):243–284, 2017. doi: 10.1137/16M1080318.
- First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146:37–75, 2014.
- Stochastic approximation beyond gradient for signal processing and machine learning. IEEE Transactions on Signal Processing, 2023.
- A. Doucet and V. Tadic. Asymptotic bias of stochastic gradient search. Annals of Applied Probability, 27(6), 2017.
- Implicit deep learning. SIAM Journal on Mathematics of Data Science, 3(3):930–958, 2021. doi: 10.1137/20M1358517.
- Y. Ermoliev. Stochastic quasigradient methods and their application to system optimization. Stochastics: An International Journal of Probability and Stochastic Processes, 9(1-2):1–36, 1983.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
- G. B. Folland. Real analysis: modern techniques and their applications, volume 40. John Wiley & Sons, 1999.
- Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pages 1165–1173. PMLR, 2017.
- Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pages 1568–1577. PMLR, 2018.
- Jfb: Jacobian-free backpropagation for implicit models. Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
- J. C. Gilbert. Automatic differentiation and iterative processes. Optimization Methods and Software, 1(1):13–21, 1992. doi: 10.1080/10556789208805503.
- Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200–5209. PMLR, 2019.
- On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, pages 3748–3758. PMLR, 2020.
- Convergence properties of stochastic hypergradients. In International Conference on Artificial Intelligence and Statistics, pages 3826–3834. PMLR, 2021.
- Bilevel optimization with a lower-level contraction: Optimal sample complexity without warm-start. Journal of Machine Learning Research, 24(167):1–37, 2023.
- Nonsmooth implicit differentiation: Deterministic and stochastic convergence rates. arXiv preprint arXiv:2403.11687, 2024.
- A. Griewank. On automatic differentiation. Mathematical Programming: recent developments and applications, 6(6):83–107, 1989.
- A. Griewank and C. Faure. Piggyback differentiation and optimization. In Large-scale PDE-constrained optimization, pages 148–164. Springer, 2003.
- A. Griewank and A. Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008.
- Derivative convergence for iterative equation solvers. Optimization Methods and Software, 2(3-4):321–355, 1993. doi: 10.1080/10556789308805549.
- Bilevel optimization: Convergence analysis and enhanced design. In International conference on machine learning, pages 4882–4892. PMLR, 2021.
- Optimizing millions of hyperparameters by implicit differentiation. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1540–1552. PMLR, 26–28 Aug 2020.
- Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pages 2113–2122. PMLR, 2015.
- S. Mehmood and P. Ochs. Automatic differentiation of some first-order methods in parametric optimization. In International Conference on Artificial Intelligence and Statistics, pages 1584–1594. PMLR, 2020.
- S. Mehmood and P. Ochs. Fixed-point automatic differentiation of forward–backward splitting algorithms for partly smooth functions. arXiv preprint arXiv:2208.03107, 2022.
- A. Nedić and D. P. Bertsekas. The effect of deterministic noise in subgradient methods. Mathematical programming, 125(1):75–99, 2010.
- Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
- F. Pedregosa. Hyperparameter optimization with approximate gradient. In International conference on machine learning, pages 737–746. PMLR, 2016.
- Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.
- A. Ramaswamy and S. Bhatnagar. Analysis of gradient descent methods with nondiminishing bounded errors. IEEE Transactions on Automatic Control, 63(5):1465–1471, 2017.
- H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- Learning representations by back-propagating errors. Nature, 323(6088):533–536, Oct. 1986. ISSN 1476-4687. doi: 10.1038/323533a0.
- Truncated back-propagation for bilevel optimization. In AISTATS, pages 1723–1732, 2019.
- M. V. Solodov and S. Zavriev. Error stability properties of generalized gradient-type algorithms. Journal of Optimization Theory and Applications, 98:663–680, 1998.
- R. E. Wengert. A simple automatic derivative evaluation program. Communications of the ACM, 7(8):463–464, Aug. 1964. ISSN 0001-0782. doi: 10.1145/355586.364791.