Limitations of the Empirical Fisher Approximation for Natural Gradient Descent (1905.12558v3)

Published 29 May 2019 in cs.LG and stat.ML

Abstract: Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information. Several highly visible works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We dispute this argument by showing that the empirical Fisher---unlike the Fisher---does not generally capture second-order information. We further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian) are unlikely to be met in practice, and that, even on simple optimization problems, the pathologies of the empirical Fisher can have undesirable effects.

Citations (194)

View on Semantic Scholar

Summary

The paper challenges the validity of the empirical Fisher matrix as a proxy for second-order information in natural gradient descent.
The analysis demonstrates that the EF approximation distorts gradient fields and leads to inefficient optimization in non-convex tasks.
The authors advocate rethinking the EF's role as a signal-to-noise ratio and exploring adaptive preconditioning strategies.

Analysis of the Empirical Fisher Approximation's Limitations

The paper "Limitations of the Empirical Fisher Approximation" conducts an in-depth investigation into the nuanced distinctions between the empirical Fisher (EF) and the true Fisher information matrix, challenging a widely held notion in the machine learning community. The document, authored by Kunstner, Balles, and Hennig, critiques the empirical Fisher's validity for capturing second-order information and explores the theoretical underpinnings and implications of using such approximations as preconditioners in optimization algorithms.

Key Arguments and Critical Analysis

The authors present a compelling critique of the empirical Fisher’s applicability as an approximation to the true Fisher matrix. They argue that although other research has drawn parallels between the EF and second-order optimization heuristics like Adam, the EF lacks the necessary theoretical grounding to capture second-order information effectively. The paper systematically addresses and refutes two primary justifications for using the EF:

The Empirical Fisher as a Generalized Gauss-Newton Matrix: While technically, the EF can be construed to fit the definition of a generalized Gauss-Newton matrix, the authors highlight that the specific formulation fails to capture meaningful second-order information. They suggest that the EF’s construction neglects non-negligible aspects of curvature, which a genuine Gauss-Newton methodology would typically capture.
Convergence to the True Fisher at Optima: The authors also challenge the assumption that the EF converges to the Fisher when the model accurately fits the data. They elucidate that this claim presupposes two stringent conditions: model correctness and sufficiently large datasets relative to model complexity—conditions often unmet in contemporary machine learning applications, specifically in non-convex optimization tasks and deep learning with over-parameterized models.

Numerical and Experimental Insights

The authors support their arguments with concrete numerical examples and experimental results, underscoring the EF's inadequacies. Particularly, they show through linear regression and classification tasks that the EF often distorts gradient fields, leading to inefficient and skewed optimization paths when employed as a preconditioner.

Outlined simulation results demonstrate considerable disparities in step sizes and optimization trajectories when using EF-preconditioned updates compared to true natural gradient updates (NGD). Moreover, they prove that when relevant conditions (e.g., data alignments, proper modeling) are unmet, the EF can not only fail to approximate the Fisher but can also lead to profoundly suboptimal preconditioning effects.

Implications and Future Directions

The authors’ critique bears significant implications for practitioners who utilize the EF approximation across various machine learning frameworks. While practical successes using EF-based methods cannot be dismissed, the paper demands a reassessment of its theoretical underpinnings, cautioning against its blind adoption.

Given these findings, the paper stimulates discourse on alternative paths, such as variance adaptation, which exploit the EF’s nature as a signal-to-noise ratio in stochastic settings rather than as a curvature proxy. Future research directions could explore adaptive preconditioning strategies leveraging this perspective, particularly in scenarios characterized by gradient noise rather than curvature issues.

Conclusion

The paper by Kunstner, Balles, and Hennig addresses crucial misconceptions surrounding the empirical Fisher matrix. It uncovers limitations that could mislead both theoretical research and practical applications if not duly acknowledged. By challenging preconceived notions, the authors pave the way for a more rigorous understanding of second-order approximations in machine learning, guiding future research endeavors towards more theoretically sound methodologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/typedfemale/status/1704233974548361693