Empirical Fisher Matrix Approximation

Updated 27 April 2026

Empirical Fisher Matrix Approximation is a data-driven technique that estimates the Fisher Information Matrix using observed gradients to enable scalable optimization in high-dimensional settings.
It leverages methods such as Monte Carlo estimators, Kronecker factorization, and adaptive approaches like iEF and Squisher to address bias and computational challenges.
These approximations are crucial for improving natural gradient descent, preconditioning, and diagnostics in large-scale deep neural network training.

An empirical Fisher matrix approximation refers to a collection of data-driven, computational procedures for estimating the Fisher Information Matrix (FIM) in statistical inference and machine learning. These approximations are central to optimizing high-dimensional models—including deep neural networks—where evaluating or inverting the exact FIM is often computationally infeasible. Modern empirical Fisher approximations range from classical moment-based estimators to advanced structured approximations (e.g., Kronecker factorization) and adaptive optimizations. This article details key methods, their mathematical formulations, known limitations, and advances in scalable, accurate empirical Fisher matrix approximations, with an emphasis on technically rigorous approaches.

1. Mathematical Framework and Standard Definitions

Let $X$ denote data sampled from a probability model $p(x; \theta)$ , with parameter $\theta \in \mathbb{R}^d$ . The Fisher Information Matrix at $\theta$ is defined by the expected outer product of score vectors: $I(\theta) = \mathbb{E}_{X \sim p(\cdot;\theta)} \big[ \nabla_{\theta} \log p(X;\theta) \nabla_{\theta} \log p(X;\theta)^\top \big]$ Alternatively, under regularity conditions, it coincides with the negative expected Hessian: $I(\theta) = -\mathbb{E}_{X \sim p(\cdot;\theta)} \left[ \nabla_\theta^2 \log p(X;\theta) \right]$ The empirical Fisher (EF) estimator replaces the population expectations with empirical averages and often uses observed (not model-sampled) labels: $\widetilde F = \frac{1}{N} \sum_{n=1}^N \nabla_\theta \log p_\theta(y_n \mid x_n) \nabla_\theta \log p_\theta(y_n \mid x_n)^\top$ This formulation underpins the bulk of empirical Fisher approximation schemes used for preconditioning, natural gradient descent, and information-geometry-based optimization (Coulton et al., 2023, Koroko et al., 2022, Wu et al., 2024, Kunstner et al., 2019).

2. Monte Carlo and Nonparametric Estimators

In settings where analytic computation of $I(\theta)$ is intractable, Monte Carlo estimators form the basis of empirical Fisher approximation. For a batch of simulated or observed data points $\{x_i\}$ , the standard Monte Carlo estimator is: $\widehat I_{\mathrm{std}} = \frac{1}{N} \sum_{i=1}^N s_i s_i^\top, \quad s_i = \nabla_\theta \log p(x_i; \theta)$ However, $p(x; \theta)$ 0 is systematically biased upward due to Monte Carlo noise in the estimated score vectors, scaling as $p(x; \theta)$ 1, and leads to overestimated information and underconservative confidence intervals (Coulton et al., 2023). An alternative, mean-centered covariance estimator,

$p(x; \theta)$ 2

is negatively biased to the same order. Combining these two via the geometric mean cancels the leading-order bias: $p(x; \theta)$ 3 where $p(x; \theta)$ 4 denotes the positive definite geometric mean. Empirically, $p(x; \theta)$ 5 converges much faster than either constituent estimator alone for the same sample size (Coulton et al., 2023).

For nonparametric estimation, the curvature of an $p(x; \theta)$ 6-divergence between neighboring parameterizations can be exploited. This approach estimates the FIM via finite-differenced estimates of the local divergence, using empirical estimators of the divergence (e.g., kNN or kernel density ratio) and forming the Hessian with respect to parameter perturbations (Berisha et al., 2014).

3. Pathologies and Theoretical Limitations

The empirical Fisher is not, in general, an unbiased or even consistent estimator of the true Fisher matrix outside certain regime limits (realizability, infinite data, or special model classes):

The empirical Fisher replaces the model expectation over labels with a single observed value, introducing irreducible bias except at the optimum in the infinite data limit under perfect model specification (Kunstner et al., 2019).
For least squares, the empirical Fisher can collapse to zero as residuals vanish, causing the preconditioner to diverge and destroying second-order structure.
The EF does not coincide with the Hessian except under exponential-family and "small residual" conditions, where the generalized Gauss–Newton matrix matches the Fisher (Kunstner et al., 2019).
In high-dimensional or overparameterized models, EF-based updates may be nearly orthogonal to natural gradients, leading to optimization pathologies (Wu et al., 2024).
The “inversely-scaled projection” issue: Empirical Fisher preconditioners enforce identical per-sample loss reductions regardless of the sample’s gradient norm, leading to overshooting for well-learned examples and underweighting challenging samples (Wu et al., 2024).

4. Structured Approximations: Kronecker and Low-Rank Factorization

Deep networks exhibit parameter structure (e.g., weights as matrices), making direct manipulation of the full Fisher matrix infeasible. Structured approximations exploit this via block-diagonal and Kronecker factorized forms:

KFAC and related methods approximate per-layer Fisher blocks as $p(x; \theta)$ 7, where $p(x; \theta)$ 8 and $p(x; \theta)$ 9 are the covariances of layer inputs and output gradients, respectively. This enables efficient inversion using properties of Kronecker products (Koroko et al., 2022).
Kronecker product SVD (KPSVD) and rank-2 deflation further improve the fit to the empirical Fisher by directly minimizing the Frobenius norm difference to the best (sum-of-)Kronecker approximation via SVD or Lanczos decomposition. This improves optimizer convergence beyond KFAC and standard EF-based approaches (Koroko et al., 2022).
The DyKAF algorithm introduces a dynamic, projector-splitting scheme to maintain best rank-1 Kronecker approximations in a strictly online manner, provably improving approximation error rates and optimizer stability. Each projector-splitting update integrates new gradient information without constructing full high-dimensional Fisher blocks. DyKAF demonstrates both theoretical and empirical superiority to previous heuristics in large models (Yudin et al., 9 Nov 2025).

Method	Structure Used	Inversion Complexity
KFAC	Per-layer Kronecker	$\theta \in \mathbb{R}^d$ 0
KPSVD, Deflation	Kronecker SVD (rank ≥ 1)	$\theta \in \mathbb{R}^d$ 1 (power SVD)
DyKAF	Dynamic proj. Kronecker	$\theta \in \mathbb{R}^d$ 2 per layer

5. Recent Advances: Improved Empirical Fisher and Modern Variants

Empirical Fisher pathologies have motivated improved estimators:

The Improved Empirical Fisher (iEF) method addresses the inversely-scaled projection flaw in EF. iEF introduces sample-wise scaling with logit-gradient norms, aligning per-sample updates with the true natural gradient spectrum, and demonstrating superior convergence and robustness to damping schedules across tasks (Wu et al., 2024).
The Squisher approach reuses Adam’s bias-corrected squared gradient accumulator as a scalable, "zero-cost" estimator for the empirical Fisher diagonal, yielding performance essentially indistinguishable from standard empirical Fisher diagnostics in model merging, parameter pruning, continual learning, and task embedding applications (Li et al., 24 Jul 2025).
In variational inference, iterative rank-one updates with the Sherman–Morrison formula enable direct online inversion of the empirical Fisher without ever forming or storing the matrix, with provable $\theta \in \mathbb{R}^d$ 3 convergence and asymptotic normality in parameter averages (Godichon-Baggioni et al., 2023).

6. Practical Implications and Applications

Empirical Fisher approximations play roles in:

Natural gradient optimization, adaptive preconditioning, and second-order method acceleration.
Experimental design and uncertainty quantification, leveraging the empirical Fisher for Cramér–Rao bounds and parameter error estimation.
Diagnostics such as continual learning penalties (EWC), task embedding (Task2Vec), parameter pruning, and federated model merging, where diagonals or low-rank blocks are especially effective (Li et al., 24 Jul 2025, Koroko et al., 2022).
Implementation in large-scale models: Structured approximations, dynamic Kronecker methods, and adaptive diagonal tracking (Squisher) enable practical deployment by reducing $\theta \in \mathbb{R}^d$ 4 storage and $\theta \in \mathbb{R}^d$ 5 computational complexity to feasible cost models for $\theta \in \mathbb{R}^d$ 6.

Despite these advances, it is critical to recognize limitations of empirical Fisher approaches in capturing second-order curvature, particularly in misspecified or highly non-Gaussian regimes. Modern schemes—iEF, projector-splitting Kronecker approximations, Squisher, and iterative Sherman–Morrison inversions—mitigate many traditional pathologies and extend usability to ever-larger and more complex models (Yudin et al., 9 Nov 2025, Wu et al., 2024, Li et al., 24 Jul 2025, Godichon-Baggioni et al., 2023).

7. Limitations, Efficiency, and Theoretical Guarantees

Asymptotic analyses show that empirical and Hessian-based Fisher estimators are both consistent under regularity (e.g., differentiability, bounded moments, realizability). However, in both scalar and multivariate cases, Hessian-based estimators often exhibit strictly smaller variance than standard empirical Fisher moment estimators (Guo, 2014). Bias–variance tradeoffs, high-dimensional curse, and sensitivity to sampling/noise must be considered in empirical Fisher deployments. For nonparametric schemes (Berisha et al., 2014), curse of dimensionality limits precision in large- $\theta \in \mathbb{R}^d$ 7 settings; for Squisher (Li et al., 24 Jul 2025), moving-average hyperparameter choices affect estimator bias, though empirically the effect is minor.

A plausible implication is that, for practical large-scale deep learning, empirical Fisher methods are most effective when paired with structural (Kronecker, block diagonal) and principled correction strategies (iEF, bias-combined MC, Squisher) to balance fidelity and efficiency. The dynamic projector-splitting and SVD-inspired approaches set the current state-of-the-art in accuracy and scalability for empirical Fisher matrix approximation and its variants.