FOSI: Hybrid First and Second Order Optimization
Abstract: Popular machine learning approaches forgo second-order information due to the difficulty of computing curvature in high dimensions. We present FOSI, a novel meta-algorithm that improves the performance of any base first-order optimizer by efficiently incorporating second-order information during the optimization process. In each iteration, FOSI implicitly splits the function into two quadratic functions defined on orthogonal subspaces, then uses a second-order method to minimize the first, and the base optimizer to minimize the other. We formally analyze FOSI's convergence and the conditions under which it improves a base optimizer. Our empirical evaluation demonstrates that FOSI improves the convergence rate and optimization time of first-order methods such as Heavy-Ball and Adam, and outperforms second-order methods (K-FAC and L-BFGS).
- Locoprop: Enhancing backprop via local loss optimization. In International Conference on Artificial Intelligence and Statistics, pp. 9626–9642. PMLR, 2022.
- Efficient and modular implicit differentiation. arXiv preprint arXiv:2105.15183, 2021.
- KFAC-JAX, 2022. URL http://github.com/deepmind/kfac-jax.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Computing probabilistic bounds for extreme eigenvalues of symmetric matrices with the lanczos method. SIAM Journal on Matrix Analysis and Applications, 22, 01 2001. doi: 10.1137/S0895479800366859.
- Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12(null):2121–2159, jul 2011. ISSN 1532-4435.
- M-fac: Efficient matrix-free approximations of second-order information. Advances in Neural Information Processing Systems, 34:14873–14886, 2021.
- Jean Gallier et al. The Schur complement and symmetric positive semidefinite (and definite) matrices (2019). URL https://www. cis. upenn. edu/jean/schur-comp. pdf, 2020.
- Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
- James Gentle. Matrix Algebra: Theory, Computations and Applications in Statistics. Springer, 01 2017. ISBN 978-3-319-64866-8. doi: 10.1007/978-3-319-64867-5.
- Practical quasi-newton methods for training deep neural networks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
- Shampoo: Preconditioned stochastic tensor optimization. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1842–1850. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/gupta18a.html.
- Haiku: Sonnet for JAX, 2020. URL http://github.com/deepmind/dm-haiku.
- Small steps and giant leaps: Minimal Newton solvers for deep learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4763–4772, 2019.
- Doubly adaptive scaled algorithm for machine learning using second-order information. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=HCelXXcSEuH.
- Andrej Karpathy. char-rnn, 2015. URL https://github.com/karpathy/char-rnn.
- Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
- Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of research of the National Bureau of Standards, 45:255–282, 1950.
- Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016. doi: 10.1137/15M1009597. URL https://doi.org/10.1137/15M1009597.
- Necessary and sufficient geometries for gradient methods. Advances in Neural Information Processing Systems, 32, 2019.
- Xi-Lin Li. Preconditioned stochastic gradient descent. IEEE transactions on neural networks and learning systems, 29(5):1454–1466, 2017.
- Phillip Lippe. UvA Deep Learning Tutorials. https://uvadlc-notebooks.readthedocs.io/en/latest/, 2022.
- On the limited memory BFGS method for large scale optimization. Mathematical programming, 45(1):503–528, 1989.
- Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
- Optimizing neural networks with Kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 2408–2417. JMLR.org, 2015.
- Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 1033–1040, 2011.
- Kronecker-factored curvature approximations for recurrent neural networks. In International Conference on Learning Representations, 2018.
- James Martens et al. Deep learning via hessian-free optimization. In ICML, volume 27, pp. 735–742, 2010.
- The Lanczos and conjugate gradient algorithms in finite precision arithmetic. Acta Numerica, 15:471–542, 2006.
- Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- Numerical optimization. Springer, 1999.
- Vanishing curvature in randomly initialized deep ReLU networks. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 7942–7975. PMLR, 28–30 Mar 2022. URL https://proceedings.mlr.press/v151/orvieto22a.html.
- Barak A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147–160, 1994. doi: 10.1162/neco.1994.6.1.147.
- Boris T Polyak. Introduction to optimization. optimization software. Inc., Publications Division, New York, 1:32, 1987.
- Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
- Diagonal preconditioning: Theory and algorithms. arXiv preprint arXiv:2003.07545, 2020.
- Sub-sampled newton methods. Math. Program., 174(1–2):293–326, mar 2019. ISSN 0025-5610. doi: 10.1007/s10107-018-1346-5. URL https://doi.org/10.1007/s10107-018-1346-5.
- AutoMon: Automatic distributed monitoring for arbitrary multivariate functions. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD ’22, pp. 310–324. Association for Computing Machinery, 2022. ISBN 9781450392495. doi: 10.1145/3514221.3517866. URL https://doi.org/10.1145/3514221.3517866.
- Review of second-order optimization techniques in artificial neural networks backpropagation. IOP Conference Series: Materials Science and Engineering, 495:012003, jun 2019. doi: 10.1088/1757-899x/495/1/012003. URL https://doi.org/10.1088/1757-899x/495/1/012003.
- Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- John C. Urschel. Uniform error estimates for the Lanczos method. SIAM Journal on Matrix Analysis and Applications, 42(3):1423–1450, 2021. doi: 10.1137/20M1331470. URL https://doi.org/10.1137/20M1331470.
- Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27(2):927–956, 2017. doi: 10.1137/15M1053141. URL https://doi.org/10.1137/15M1053141.
- The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems, 30, 2017.
- Sub-sampled newton methods with non-uniform sampling. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 3008–3016, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
- An efficient fisher matrix approximation method for large-scale neural network optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(05):5391–5403, may 2023. ISSN 1939-3539. doi: 10.1109/TPAMI.2022.3213654.
- Adahessian: An adaptive second order optimizer for machine learning. In proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 10665–10673, 2021.
- Eva: Practical second-order optimization with kronecker-vectorized approximation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=_Mic8V96Voy.
- Towards theoretically understanding why SGD generalizes better than Adam in deep learning. Advances in Neural Information Processing Systems, 33:21285–21296, 2020.
- Milija Zupanski. A preconditioning algorithm for large‐scale minimization problems. Tellus A, 45:478 – 492, 11 2002. doi: 10.1034/j.1600-0870.1993.00011.x.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.