Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function (2310.11866v1)
Abstract: Trust-region (TR) and adaptive regularization using cubics (ARC) have proven to have some very appealing theoretical properties for non-convex optimization by concurrently computing function value, gradient, and Hessian matrix to obtain the next search direction and the adjusted parameters. Although stochastic approximations help largely reduce the computational cost, it is challenging to theoretically guarantee the convergence rate. In this paper, we explore a family of stochastic TR and ARC methods that can simultaneously provide inexact computations of the Hessian matrix, gradient, and function values. Our algorithms require much fewer propagations overhead per iteration than TR and ARC. We prove that the iteration complexity to achieve $\epsilon$-approximate second-order optimality is of the same order as the exact computations demonstrated in previous studies. Additionally, the mild conditions on inexactness can be met by leveraging a random sampling technology in the finite-sum minimization problem. Numerical experiments with a non-convex problem support these findings and demonstrate that, with the same or a similar number of iterations, our algorithms require less computational overhead per iteration than current second-order methods.
- R. Ge, C. Jin, and Y. Zheng, “No spurious local minima in nonconvex low rank problems: A unified geometric analysis,” arXiv preprint arXiv:1704.00708, 2017.
- C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “How to escape saddle points efficiently,” arXiv preprint arXiv:1703.00887, 2017.
- C. Cartis, N. I. Gould, and P. L. Toint, “Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results,” Mathematical Programming, vol. 127, no. 2, pp. 245–295, 2011.
- ——, “Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity,” Mathematical programming, vol. 130, no. 2, pp. 295–319, 2011.
- X. Chen, B. Jiang, T. Lin, and S. Zhang, “On adaptive cubic regularized newton’s methods for convex optimization via random sampling,” arXiv preprint arXiv:1802.05426, 2018.
- S. Bellavia, G. Gurioli, and B. Morini, “Adaptive cubic regularization methods with dynamic inexact hessian information and applications to finite-sum minimization,” IMA Journal of Numerical Analysis, vol. 41, no. 1, pp. 764–799, 2021.
- P. Xu, F. Roosta, and M. W. Mahoney, “Second-order optimization for non-convex machine learning: An empirical study,” in Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 2020.
- ——, “Newton-type methods for non-convex optimization under inexact hessian information,” Mathematical Programming, vol. 184, no. 1, pp. 35–70, 2020.
- C. Cartis, N. I. Gould, and P. L. Toint, “On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization,” SIAM Journal on Optimization, 2012.
- J. M. Kohler and A. Lucchi, “Sub-sampled cubic regularization for non-convex optimization,” in ICML, 2017.
- C. Cartis and K. Scheinberg, “Global convergence rate analysis of unconstrained optimization methods based on probabilistic models,” Mathematical Programming, 2018.
- N. Tripuraneni, M. Stern, C. Jin, J. Regier, and M. I. Jordan, “Stochastic cubic regularization for fast nonconvex optimization,” in NeurIPS, 2018, pp. 2899–2908.
- E. H. Bergou, Y. Diouane, V. Kunc, V. Kungurtsev, and C. W. Royer, “A subsampling line-search method with second-order results,” INFORMS Journal on Optimization, 2022.
- S. Bellavia, G. Gurioli, B. Morini, and P. L. Toint, “Adaptive regularization algorithms with inexact evaluations for nonconvex optimization,” SIAM Journal on Optimization, 2019.
- T. Zhang, “Solving large scale linear prediction problems using stochastic gradient descent algorithms,” in ICML, 2004.
- L. Liu, J. Liu, C.-J. Hsieh, and D. Tao, “Stochastically controlled compositional gradient for composition problems,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
- R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in NeurIPS, 2013.
- A. Defazio, F. Bach, and S. Lacoste-Julien, “Saga: A fast incremental gradient method with support for non-strongly convex composite objectives,” in NeurIPS, 2014.
- L. Liu, J. Liu, and D. Tao, “Dualityfree methods for stochastic composition optimization,” IEEE Transactions on Neural Networks and Learning Systems, 2019.
- S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochastic variance reduction for nonconvex optimization,” in ICML, 2016, pp. 314–323.
- L. Lei, C. Ju, J. Chen, and M. I. Jordan, “Non-convex finite-sum optimization via scsg methods,” in NeurIPS, 2017.
- U. Marteau-Ferey, F. Bach, and A. Rudi, “Globally convergent newton methods for ill-conditioned generalized self-concordant losses,” in NeurIPS, 2019.
- R. Crane and F. Roosta, “Dingo: Distributed newton-type method for gradient-norm optimization,” in NeurIPS, 2019.
- J. Lan, Y.-J. Liu, D. Yu, G. Wen, S. Tong, and L. Liu, “Time-varying optimal formation control for second-order multiagent systems based on neural network observer and reinforcement learning,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–12, 2022.
- X. Wang, W. X. Zheng, and G. Wang, “Distributed finite-time optimization of second-order multiagent systems with unknown velocities and disturbances,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–13, 2022.
- Y. Zou, K. Xia, B. Huang, and Z. Meng, “Distributed optimization for second-order discrete-time multiagent systems with set constraints,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–11, 2021.
- J. Nocedal, “Updating quasi-newton matrices with limited storage,” Mathematics of computation, 1980.
- R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal, “On the use of stochastic hessian information in optimization methods for machine learning,” SIAM Journal on Optimization, 2011.
- M. A. Erdogdu and A. Montanari, “Convergence rates of sub-sampled newton methods,” in NeurIPS, 2015.
- Z. Yao, P. Xu, F. Roosta-Khorasani, and M. W. Mahoney, “Inexact non-convex newton-type methods,” INFORMS Journal on Optimization, vol. 3, no. 2, pp. 154–182, 2021.
- S. Bellavia, G. Gurioli, B. Morini, and P. L. Toint, “Adaptive regularization for nonconvex optimization using inexact function values and randomly perturbed derivatives,” Journal of Complexity, vol. 68, p. 101591, 2022.
- L. Cao, A. S. Berahas, and K. Scheinberg, “First- and second-order high probability complexity bounds for trust-region methods with noisy oracles,” 2023.
- F. E. Curtis, K. Scheinberg, and R. Shi, “A stochastic trust region algorithm based on careful step normalization,” Informs Journal on Optimization, vol. 1, no. 3, 2019.
- S. Sun and J. Nocedal, “A trust region method for the optimization of noisy functions,” 2022.
- D. Garber, E. Hazan, C. Jin, S. M. Kakade, C. Musco, P. Netrapalli, and A. Sidford, “Faster eigenvector computation via shift-and-invert preconditioning.” in ICML, 2016, pp. 2626–2634.
- J. Kuczyński and H. Woźniakowski, “Estimating the largest eigenvalue by the power and lanczos algorithms with a random start,” SIAM journal on matrix analysis and applications, vol. 13, no. 4, pp. 1094–1122, 1992.
- Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford, “Accelerated methods for nonconvex optimization,” SIAM Journal on Optimization, vol. 28, no. 2, pp. 1751–1772, 2018.
- J. Blanchet, C. Cartis, M. Menickelly, and K. Scheinberg, “Convergence rate analysis of a stochastic trust-region method via supermartingales,” INFORMS journal on optimization, 2019.
- S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang, “Complexity and global rates of trust-region methods based on probabilistic models,” IMA Journal of Numerical Analysis, vol. 38, no. 3, pp. 1579–1597, 2018.
- R. Chen, M. Menickelly, and K. Scheinberg, “Stochastic optimization using a trust-region method and random models,” Mathematical Programming, vol. 169, no. 2, pp. 447–487, 2018.
- P. Drineas, R. Kannan, and M. W. Mahoney, “Fast Monte-Carlo algorithms for matrices i: Approximating matrix multiplication,” SIAM Journal on Computing, vol. 36, no. 1, pp. 132–157, 2006.
- M. W. Mahoney, “Randomized algorithms for matrices and data,” Foundations and Trends® in Machine Learning, vol. 3, no. 2, pp. 123–224, 2011.
- E. J. Candes and Y. Plan, “A probabilistic and ripless theory of compressed sensing,” IEEE Transactions on Information Theory, vol. 57, no. 11, pp. 7235–7254, 2011.
- D. Gross, “Recovering low-rank matrices from few coefficients in any basis,” IEEE Transactions on Information Theory, vol. 57, no. 3, pp. 1548–1566, 2011.
- J. Martens and I. Sutskever, “Training deep and recurrent networks with hessian-free optimization,” in Neural networks: Tricks of the trade. Springer, 2012, pp. 479–535.
- Y. Nesterov and B. T. Polyak, “Cubic regularization of newton method and its global performance,” Mathematical Programming, vol. 108, no. 1, pp. 177–205, 2006.
- D. Gross and V. Nesme, “Note on sampling without replacing from a finite collection of matrices,” arXiv preprint arXiv:1001.2738, 2010.