Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis (1806.03884v2)

Published 11 Jun 2018 in cs.LG and stat.ML

Abstract: Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the covariance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approximations and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens & Grosse, 2015; Grosse & Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.

Citations (137)

View on Semantic Scholar

Summary

The paper introduces EKFAC, which refines traditional natural gradient descent by applying a diagonal rescaling in a Kronecker-factored eigenbasis.
It demonstrates improved approximation of the Fisher Information Matrix, reducing error in terms of the Frobenius norm compared to KFAC.
EKFAC achieves faster convergence and enhanced computational efficiency across neural architectures like deep autoencoders and convolutional networks.

Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis

The paper "Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis" presents a novel approach to optimizing neural network training by enhancing existing methods that utilize second-order information. The authors propose the Eigenvalue-corrected Kronecker Factorization (EKFAC), which builds on Kronecker-Factored Approximate Curvature (KFAC) by correcting the eigenvalues in the Kronecker-factored eigenbasis, offering a finer approximation of the Fisher Information Matrix (FIM).

Background

Natural Gradient Descent, rooted in using the Fisher Information Matrix, offers significant improvements for optimization by considering the local curvature in parameter space. However, the curse of dimensionality makes naive applications intractable for large-scale networks. Various approximations, including KFAC, simplify the computation by reducing dimensionality through Kronecker products. The KFAC method approximates blocks of the FIM using two smaller, manageable matrices, but these simplifications might not fully capture the second-order curvature characteristics.

Contributions

EKFAC leverages the eigenbasis of the Kronecker-factored approximation rather than the parameter coordinates directly. This shift allows for diagonal variance tracking in a potentially more effective basis for optimization. The authors argue that the proposed method refines the rescaling of gradients along this eigenbasis, thereby providing a better preconditioning than KFAC.

Key contributions of EKFAC include:

Improved Approximation: EKFAC utilizes a diagonal rescaling in the Kronecker-factored eigenbasis, which is proven to be a better approximation of the FIM compared to KFAC in terms of Frobenius norm.
Computational Efficiency: The re-estimation of the diagonal variance is computationally inexpensive, allowing frequent updates without the full Kronecker-factored eigenbasis recalculations.
Proven Advantage: The paper provides a theoretical guarantee that EKFAC is a more precise approximation, ensuring that the approximation error is minimized relative to alternatives like KFAC.

Findings and Future Outlook

Experimental evaluations demonstrate EKFAC's supremacy in optimization speed across various neural architectures, including deep autoencoders and convolutional networks. The approach shows consistent improvement in convergence rates per epoch and in computational efficiency.

These findings suggest a future trajectory where gradient methods in machine learning increasingly incorporate more refined approximations of second-order information. As the scale and complexity of models grow, techniques like EKFAC that strike a balance between precision and computational tractability are expected to become central in training state-of-the-art models. Potential future work includes adaptation of other component-wise adaptive algorithms to the eigenbasis used by EKFAC and exploration of alternative strategies for obtaining the eigenbasis. Moreover, refining hyperparameter tuning, especially for damping, could further enhance the robustness and applicability of EKFAC in broader contexts.

Therefore, while EKFAC significantly advances current methodologies, ongoing research and development will be crucial in harnessing its full potential across diverse and increasingly complex machine learning applications.

PDF Markdown

Related Papers

YouTube

Show All Videos