SOFIM: Stochastic Optimization Using Regularized Fisher Information Matrix (2403.02833v2)
Abstract: This paper introduces a new stochastic optimization method based on the regularized Fisher information matrix (FIM), named SOFIM, which can efficiently utilize the FIM to approximate the Hessian matrix for finding Newton's gradient update in large-scale stochastic optimization of machine learning models. It can be viewed as a variant of natural gradient descent, where the challenge of storing and calculating the full FIM is addressed through making use of the regularized FIM and directly finding the gradient update direction via Sherman-Morrison matrix inversion. Additionally, like the popular Adam method, SOFIM uses the first moment of the gradient to address the issue of non-stationary objectives across mini-batches due to heterogeneous data. The utilization of the regularized FIM and Sherman-Morrison matrix inversion leads to the improved convergence rate with the same space and time complexities as stochastic gradient descent (SGD) with momentum. The extensive experiments on training deep learning models using several benchmark image classification datasets demonstrate that the proposed SOFIM outperforms SGD with momentum and several state-of-the-art Newton optimization methods in term of the convergence speed for achieving the pre-specified objectives of training and test losses as well as test accuracy.
- R. Grosse, “Taylor approximations,” Neural Network Training Dynamics. Lecture Notes, University of Toronto, 2021.
- N. Ketkar and N. Ketkar, “Stochastic gradient descent,” Deep learning with Python: A hands-on introduction, pp. 113–132, 2017.
- J. J. Moré and D. C. Sorensen, “Newton’s method,” Argonne National Lab., IL (USA), Tech. Rep., 1982.
- A. Cutkosky and H. Mehta, “Momentum improves normalized sgd,” in International conference on machine learning. PMLR, 2020, pp. 2260–2268.
- J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization.” Journal of machine learning research, vol. 12, no. 7, 2011.
- R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” Advances in neural information processing systems, vol. 26, 2013.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM review, vol. 60, no. 2, pp. 223–311, 2018.
- B. T. Polyak, “Newton’s method and its use in optimization,” European Journal of Operational Research, vol. 181, no. 3, pp. 1086–1096, 2007.
- H. H. Tan and K. H. Lim, “Review of second-order optimization techniques in artificial neural networks backpropagation,” in IOP conference series: materials science and engineering, vol. 495, no. 1. IOP Publishing, 2019, p. 012003.
- R. Schoenberg, “Optimization with the quasi-newton method,” Aptech Systems Maple Valley WA, pp. 1–9, 2001.
- D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large scale optimization,” Mathematical programming, vol. 45, no. 1-3, pp. 503–528, 1989.
- N. N. Schraudolph, J. Yu, and S. Günter, “A stochastic quasi-newton method for online convex optimization,” in Artificial intelligence and statistics. PMLR, 2007, pp. 436–443.
- R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer, “A stochastic quasi-newton method for large-scale optimization,” SIAM Journal on Optimization, vol. 26, no. 2, pp. 1008–1031, 2016.
- P. Moritz, R. Nishihara, and M. Jordan, “A linearly-convergent stochastic l-bfgs algorithm,” in Artificial Intelligence and Statistics. PMLR, 2016, pp. 249–258.
- Z. Yao, A. Gholami, S. Shen, M. Mustafa, K. Keutzer, and M. Mahoney, “Adahessian: An adaptive second order optimizer for machine learning,” in proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 12, 2021, pp. 10 665–10 673.
- S.-I. Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
- J. Martens and R. Grosse, “Optimizing neural networks with kronecker-factored approximate curvature,” in International conference on machine learning. PMLR, 2015, pp. 2408–2417.
- D. Singh, H. Tankaria, and M. Yamada, “Nys-newton: Nystr\\\backslash\” om-approximated curvature for stochastic optimization,” arXiv preprint arXiv:2110.08577, 2021.
- A. Ly, M. Marsman, J. Verhagen, R. P. Grasman, and E.-J. Wagenmakers, “A tutorial on fisher information,” Journal of Mathematical Psychology, vol. 80, pp. 40–55, 2017.
- J. Martens, “New insights and perspectives on the natural gradient method,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5776–5851, 2020.
- J. Sherman and W. J. Morrison, “Adjustment of an inverse matrix corresponding to a change in one element of a given matrix,” The Annals of Mathematical Statistics, vol. 21, no. 1, pp. 124–127, 1950.
- H. Ye, L. Luo, and Z. Zhang, “Approximate newton methods and their local convergence,” in International Conference on Machine Learning. PMLR, 2017, pp. 3931–3939.
- A. K. Qin and P. N. Suganthan, “Initialization insensitive LVQ algorithm based on cost-function adaptation,” Pattern Recognition, vol. 38, no. 5, pp. 773–776, 2005.
- M. Gong, Y. Wu, Q. Cai, W. Ma, A. K. Qin, Z. Wang, and L. Jiao, “Discrete particle swarm optimization for high-order graph matching,” Information Sciences, vol. 328, pp. 158–171, 2016.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.