Inverse-Free Fast Natural Gradient Descent Method for Deep Learning (2403.03473v2)
Abstract: Second-order optimization techniques have the potential to achieve faster convergence rates compared to first-order methods through the incorporation of second-order derivatives or statistics. However, their utilization in deep learning is limited due to their computational inefficiency. Various approaches have been proposed to address this issue, primarily centered on minimizing the size of the matrix to be inverted. Nevertheless, the necessity of performing the inverse operation iteratively persists. In this work, we present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch. Specifically, it is revealed that natural gradient descent (NGD) is essentially a weighted sum of per-sample gradients. Our novel approach further proposes to share these weighted coefficients across epochs without affecting empirical performance. Consequently, FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods. Extensive experiments on image classification and machine translation tasks demonstrate the efficiency of the proposed FNGD. For training ResNet-18 on CIFAR-100, FNGD can achieve a speedup of 2.07$\times$ compared with KFAC. For training Transformer on Multi30K, FNGD outperforms AdamW by 24 BLEU score while requiring almost the same training time.
- Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
- A mini-block fisher method for deep neural networks. arXiv preprint arXiv:2202.04124, 2022.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7):2121–2159, 2011.
- Fast approximate natural gradient descent in a kronecker factored eigenbasis. Advances in Neural Information Processing Systems, 31, 2018.
- Deep learning. MIT press, 2016.
- A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573–582. PMLR, 2016.
- Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018.
- William W Hager. Updating the inverse of a matrix. SIAM review, 31(2):221–239, 1989.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- A tutorial on fisher information. Journal of Mathematical Psychology, 80:40–55, 2017.
- Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
- Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical analysis: proceedings of the biennial Conference held at Dundee, June 28–July 1, 1977, pages 105–116. Springer, 2006.
- Hylo: a hybrid low-rank natural gradient descent method. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2022.
- Numerical optimization. Springer, 1999.
- Convolutional neural network training with distributed k-fac. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12. IEEE, 2020.
- Efficient subsampled gauss-newton and natural gradient methods for training neural networks. arXiv preprint arXiv:1906.02353, 2019.
- A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- Skfac: Training neural networks with faster kronecker-factored approximate curvature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13479–13487, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Adahessian: An adaptive second order optimizer for machine learning. In proceedings of the AAAI conference on artificial intelligence, volume 35, pages 10665–10673, 2021.
- Opacus: User-friendly differential privacy library in PyTorch. arXiv preprint arXiv:2109.12298, 2021.
- Eva: A general vectorized approximation framework for second-order optimization. arXiv preprint arXiv:2308.02123, 2023.