EK-FAC: Efficient Curvature Approximation

Updated 11 March 2026

EK-FAC approximations refine K-FAC by correcting eigenvalue mismatches to improve spectral fidelity and enable accurate inverse-Hessian-vector products in deep networks.
By leveraging a block-Kronecker structure, EK-FAC efficiently inverts large curvature matrices, significantly reducing computational costs compared to full Hessian methods.
Empirical findings demonstrate that EK-FAC achieves near-Hessian accuracy in influence function analyses, outperforming traditional K-FAC in attribution fidelity and runtime efficiency.

The Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation is a structured curvature estimation technique designed to enable computationally tractable and high-fidelity inverse-Hessian-vector products (IHVPs) in deep learning models. EK-FAC targets the dominant bottleneck in influence function computations—namely, the inversion of large and ill-conditioned Hessians or Fisher information matrices—by exploiting a block-Kronecker structure while directly correcting for eigenvalue mismatches that confound classical K-FAC. It is now the preferred scalable approach for influence function-based data attribution and sensitivity analyses in LLMs and deep neural architectures, striking an optimal balance between cost and spectral accuracy (Grosse et al., 2023, Bao et al., 8 May 2025, Hong et al., 27 Sep 2025).

1. Context: Curvature Approximations in Deep Models

In deep learning, estimation of second-order curvature (Hessian or Fisher) is critical for optimization, preconditioning, and influence function analysis. For a model with empirical risk $J(\theta) = \frac{1}{N} \sum_n L(z_n, \theta)$ , the empirical Hessian $H = \nabla^2 J(\theta)$ is typically intractable to store or invert even for moderately-sized models. Standard approximations proceed in layers:

Generalized Gauss-Newton (GGN): Linearizes the output layer, yielding $G = \mathbb{E}[J_{\theta \to y}^\top H_y J_{\theta \to y}]$ .
Block-Diagonalization: Approximates curvature as block-diagonal in layer-wise parameter partitions.
Kronecker-Factored (K-FAC): Approximates each block as a Kronecker product of input and pre-activation gradient covariances (i.e., $A \otimes S$ ).
Eigenvalue-Corrected K-FAC (EK-FAC): Retains the Kronecker eigenbasis but fits the true diagonal in that basis using empirical second moments, thus improving spectral approximation (Grosse et al., 2023, Hong et al., 27 Sep 2025, Bao et al., 8 May 2025).

Each stage reduces computational complexity but incurs spectral or structural error which propagates to downstream IHVP computations required for influence functions.

2. The K-FAC and EK-FAC Constructions

Let $W_\ell \in \mathbb{R}^{P \times M}$ be a fully-connected weight matrix for layer $\ell$ , with input activations $\overline{a}_{\ell-1} \in \mathbb{R}^{M+1}$ and backpropagated gradients $\delta_\ell \in \mathbb{R}^P$ . The empirical curvature block for this layer is approximated as

$G_\ell = \mathbb{E}[ \delta_\ell \overline{a}_{\ell-1}^\top \otimes \delta_\ell \overline{a}_{\ell-1}^\top ].$

K-FAC assumes independence between activations and gradients, yielding the factorization

$G_\ell \approx \mathbb{E}[\overline{a}_{\ell-1} \overline{a}_{\ell-1}^\top] \otimes \mathbb{E}[\delta_\ell \delta_\ell^\top] = A_\ell \otimes S_\ell.$

The inverse of this block is efficiently computed as $(A_\ell \otimes S_\ell)^{-1} = A_\ell^{-1} \otimes S_\ell^{-1}$ . Application to a vectorized parameter update reduces to matrix-matrix solves, scaling as $O(M^2P + MP^2)$ per layer.

EK-FAC refines this by addressing the poorly estimated mixed-eigenvalue spectra of $A_\ell \otimes S_\ell$ . The steps are as follows:

Compute eigendecompositions of $A_\ell$ and $S_\ell$ : \begin{align*} A_\ell &= Q_A \Lambda_A Q_A^\top, \ S_\ell &= Q_S \Lambda_S Q_S^\top. \end{align*}
Form the joint Kronecker eigenbasis $Q_\ell = Q_A \otimes Q_S$ .
Project true layer gradients into $Q_\ell$ , estimate empirical second moments along each direction:

$D^{\mathrm{EK}}_{\ell,ii} = \mathbb{E}_x[ (Q_\ell D_\ell)_i^2 ].$

The resulting approximation is:

$G_\ell \approx Q_\ell \, \mathrm{diag}( D^{\mathrm{EK}}_\ell ) \, Q_\ell^\top,$

and for regularized inversion:

$(G_\ell + \lambda I)^{-1} v_\ell = Q_\ell [ \mathrm{diag}( D^{\mathrm{EK}}_\ell ) + \lambda I ]^{-1} Q_\ell^\top v_\ell.$

(Grosse et al., 2023, Bao et al., 8 May 2025)

This preserves the efficiency of K-FAC while correcting the curvature spectrum to align with the empirical mode-by-mode variances observed in the data.

3. Algorithmic and Computational Details

The EK-FAC workflow consists of:

Maintaining exponential moving averages (EMA) for $A_\ell$ and $S_\ell$ on each mini-batch, with decay rate $β \approx 0.95$ (Grosse et al., 2023).
Periodically (e.g., every $K$ steps) performing eigendecompositions for $A_\ell$ and $S_\ell$ at cost $O(M^3 + P^3)$ .
Projecting per-example gradients into the Kronecker eigenbasis and accumulating empirical second moments (cost $O(MP)$ per batch and block).
At influence-query time, precomputing all factors and applying the IHVP at cost $O(M^2P + MP^2)$ per block.

Memory overhead per layer is $M^2 + P^2 + MP$ due to storage of eigenvectors and the empirical diagonals; block-diagonal partitioning is recommended for very large layers.

Compared to alternatives:

Full Hessian/Block-GGN are $O(D^3)$ .
EK-FAC matches K-FAC in asymptotic cost but achieves significantly lower approximation error (Grosse et al., 2023, Hong et al., 27 Sep 2025, Bao et al., 8 May 2025).

4. Spectral Accuracy, Empirical Results, and Error Decomposition

EK-FAC's central advantage is its spectral fidelity: it aligns the block-wise curvature eigenvalue spectrum with the actual empirical second moment spectrum, correcting bias introduced by the Kronecker-product assumption. Empirical studies demonstrate:

Attribution fidelity: In small-scale (MLP/UCI) and moderate-scale (transformer) settings, EK-FAC outperforms K-FAC and approaches GGN/Hessian benchmarks in data attribution tasks, as quantified by Linear Data-modelling Score (LDS) and direct Hessian-inverse-verification (Grosse et al., 2023, Hong et al., 27 Sep 2025).
Spectral overlap: EK-FAC achieves markedly higher EvalOverlap (\textasciitilde 0.9 at 100 epochs) compared to K-FAC (\textasciitilde 0.75), with remaining gap explained by block-diagonal and Kronecker-factorization errors (Hong et al., 27 Sep 2025).
Runtime gains: At LLM scales, EK-FAC reduces IHVP wall time by at least an order of magnitude over LiSSA and Arnoldi (e.g., 3.57s per 500-candidate influence scan vs. 913s for CG; factor fitting time 1.17h for EK-FAC vs. 18.6h for TRAK) (Bao et al., 8 May 2025).
Error budget: The Kronecker-factorization step dominates total approximation error (explains 40–60% error in LDS); EK-FAC recovers about half that gap. Block-diagonality error grows with network depth, while the GGN vs. Hessian substitution is only impactful far from convergence (Hong et al., 27 Sep 2025).

5. Practical Implementation, Scalability, and Recommendations

Block Partitioning: To address GPU memory constraints in very large LLMs, EK-FAC applies block-diagonalization within layers, at some cost in spectral quality (Grosse et al., 2023).
Layer selection: State-of-the-art LLM pipelines include only linear transforms (MLP input/output, MHA projections) for EK-FAC; embeddings and normalization layers are typically omitted (Bao et al., 8 May 2025).
Update frequency and damping: Eigenbasis fitting is done every 1–2K steps and damping parameter $\lambda \approx 0.1 \cdot \text{mean}(\text{eigenvalues})$ is used for stability (Grosse et al., 2023, Bao et al., 8 May 2025).
Empirical tuning: Damping, estimation batch size, and block size require tuning for system stability and runtime trade-offs (Bao et al., 8 May 2025).
IHVP computation: Once factors are cached, EK-FAC IHVPs are GPU-friendly and benefit from fused batch GEMM kernels for projection operations.

EK-FAC's empirical strengths are maximized when influence score fidelity is critical (e.g., forensic, debugging, curation tasks) in shallow or moderately-deep models, or when wall-clock constraints preclude iterative solvers. For very deep models or those with strong cross-layer coupling, hybridizations or block-GGN may become preferable (Hong et al., 27 Sep 2025).

6. Limitations and Open Challenges

Block-diagonal assumption: EK-FAC ignores cross-layer curvature, which can limit fidelity in very deep architectures (Grosse et al., 2023, Hong et al., 27 Sep 2025).
Applicability: Application is restricted to MLP and certain MHA sub-blocks; embeddings, unembedding, and normalization parameters are untreated for computational reasons (Grosse et al., 2023, Bao et al., 8 May 2025).
Nonlinear phenomena: All linearized curvature methods—EK-FAC included—fail to capture phenomena such as “grokking” or circuit formation (Grosse et al., 2023).
Hyperparameter sensitivity: Accuracy and stability depend on fitting frequency, batch-size, and choice of blocks (Bao et al., 8 May 2025).
Software/hardware constraints: Large-scale eigen-decompositions require significant CPU or mixed-precision compute; extension to arbitrary architectures or modalities (e.g., diffusion models or encoder–decoder LMs) remains an open direction (Bao et al., 8 May 2025).
Alternative methods: For absolute accuracy and if compute is unconstrained, matrix-free solvers (LiSSA, CG) are structurally unbiased but far slower (Hong et al., 27 Sep 2025).

7. Summary Table: Comparison of Curvature Approximations for IHVP

Method	Structure	Spectral Fidelity	IHVP Cost
Hessian	Full, unapproximated	1.00	$O(D^3)$
GGN	Linearized output, full	$\sim$ 1.00	$O(D^3)$
Block-GGN	Block-diagonal	$<1.00$	$O(\sum d_\ell^3)$
EK-FAC	Block-Kronecker, corrected	$0.9$–$1.0$	$O(M^2P+MP^2)$
K-FAC	Block-Kronecker, uncorrected	$0.6$–$0.8$	$O(M^2P+MP^2)$

Above: Spectral fidelity measured by EvalOverlap, cost per IHVP per block, $D = \sum_\ell d_\ell$ .

EK-FAC is established as the method of choice for scalable, high-accuracy IHVPs and influence function analyses in billion-parameter models where the cost-fidelity trade-off is dominant and eigenvalue mismatch is otherwise a limiting factor (Grosse et al., 2023, Bao et al., 8 May 2025, Hong et al., 27 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Studying Large Language Model Generalization with Influence Functions (2023)

Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization (2025)

Better Hessians Matter: Studying the Impact of Curvature Approximations in Influence Functions (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EK-FAC Approximations.

EK-FAC: Efficient Curvature Approximation

1. Context: Curvature Approximations in Deep Models

2. The K-FAC and EK-FAC Constructions

3. Algorithmic and Computational Details

4. Spectral Accuracy, Empirical Results, and Error Decomposition

5. Practical Implementation, Scalability, and Recommendations

6. Limitations and Open Challenges

7. Summary Table: Comparison of Curvature Approximations for IHVP

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

EK-FAC: Efficient Curvature Approximation

1. Context: Curvature Approximations in Deep Models

2. The K-FAC and EK-FAC Constructions

3. Algorithmic and Computational Details

4. Spectral Accuracy, Empirical Results, and Error Decomposition

5. Practical Implementation, Scalability, and Recommendations

6. Limitations and Open Challenges

7. Summary Table: Comparison of Curvature Approximations for IHVP

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research