Papers
Topics
Authors
Recent
Search
2000 character limit reached

EK-FAC: Efficient Curvature Approximation

Updated 11 March 2026
  • EK-FAC approximations refine K-FAC by correcting eigenvalue mismatches to improve spectral fidelity and enable accurate inverse-Hessian-vector products in deep networks.
  • By leveraging a block-Kronecker structure, EK-FAC efficiently inverts large curvature matrices, significantly reducing computational costs compared to full Hessian methods.
  • Empirical findings demonstrate that EK-FAC achieves near-Hessian accuracy in influence function analyses, outperforming traditional K-FAC in attribution fidelity and runtime efficiency.

The Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation is a structured curvature estimation technique designed to enable computationally tractable and high-fidelity inverse-Hessian-vector products (IHVPs) in deep learning models. EK-FAC targets the dominant bottleneck in influence function computations—namely, the inversion of large and ill-conditioned Hessians or Fisher information matrices—by exploiting a block-Kronecker structure while directly correcting for eigenvalue mismatches that confound classical K-FAC. It is now the preferred scalable approach for influence function-based data attribution and sensitivity analyses in LLMs and deep neural architectures, striking an optimal balance between cost and spectral accuracy (Grosse et al., 2023, Bao et al., 8 May 2025, Hong et al., 27 Sep 2025).

1. Context: Curvature Approximations in Deep Models

In deep learning, estimation of second-order curvature (Hessian or Fisher) is critical for optimization, preconditioning, and influence function analysis. For a model with empirical risk J(θ)=1NnL(zn,θ)J(\theta) = \frac{1}{N} \sum_n L(z_n, \theta), the empirical Hessian H=2J(θ)H = \nabla^2 J(\theta) is typically intractable to store or invert even for moderately-sized models. Standard approximations proceed in layers:

  • Generalized Gauss-Newton (GGN): Linearizes the output layer, yielding G=E[JθyHyJθy]G = \mathbb{E}[J_{\theta \to y}^\top H_y J_{\theta \to y}].
  • Block-Diagonalization: Approximates curvature as block-diagonal in layer-wise parameter partitions.
  • Kronecker-Factored (K-FAC): Approximates each block as a Kronecker product of input and pre-activation gradient covariances (i.e., ASA \otimes S).
  • Eigenvalue-Corrected K-FAC (EK-FAC): Retains the Kronecker eigenbasis but fits the true diagonal in that basis using empirical second moments, thus improving spectral approximation (Grosse et al., 2023, Hong et al., 27 Sep 2025, Bao et al., 8 May 2025).

Each stage reduces computational complexity but incurs spectral or structural error which propagates to downstream IHVP computations required for influence functions.

2. The K-FAC and EK-FAC Constructions

Let WRP×MW_\ell \in \mathbb{R}^{P \times M} be a fully-connected weight matrix for layer \ell, with input activations a1RM+1\overline{a}_{\ell-1} \in \mathbb{R}^{M+1} and backpropagated gradients δRP\delta_\ell \in \mathbb{R}^P. The empirical curvature block for this layer is approximated as

G=E[δa1δa1].G_\ell = \mathbb{E}[ \delta_\ell \overline{a}_{\ell-1}^\top \otimes \delta_\ell \overline{a}_{\ell-1}^\top ].

K-FAC assumes independence between activations and gradients, yielding the factorization

GE[a1a1]E[δδ]=AS.G_\ell \approx \mathbb{E}[\overline{a}_{\ell-1} \overline{a}_{\ell-1}^\top] \otimes \mathbb{E}[\delta_\ell \delta_\ell^\top] = A_\ell \otimes S_\ell.

The inverse of this block is efficiently computed as (AS)1=A1S1(A_\ell \otimes S_\ell)^{-1} = A_\ell^{-1} \otimes S_\ell^{-1}. Application to a vectorized parameter update reduces to matrix-matrix solves, scaling as O(M2P+MP2)O(M^2P + MP^2) per layer.

EK-FAC refines this by addressing the poorly estimated mixed-eigenvalue spectra of ASA_\ell \otimes S_\ell. The steps are as follows:

  1. Compute eigendecompositions of AA_\ell and SS_\ell: \begin{align*} A_\ell &= Q_A \Lambda_A Q_A\top, \ S_\ell &= Q_S \Lambda_S Q_S\top. \end{align*}
  2. Form the joint Kronecker eigenbasis Q=QAQSQ_\ell = Q_A \otimes Q_S.
  3. Project true layer gradients into QQ_\ell, estimate empirical second moments along each direction:

D,iiEK=Ex[(QD)i2].D^{\mathrm{EK}}_{\ell,ii} = \mathbb{E}_x[ (Q_\ell D_\ell)_i^2 ].

  1. The resulting approximation is:

GQdiag(DEK)Q,G_\ell \approx Q_\ell \, \mathrm{diag}( D^{\mathrm{EK}}_\ell ) \, Q_\ell^\top,

and for regularized inversion:

(G+λI)1v=Q[diag(DEK)+λI]1Qv.(G_\ell + \lambda I)^{-1} v_\ell = Q_\ell [ \mathrm{diag}( D^{\mathrm{EK}}_\ell ) + \lambda I ]^{-1} Q_\ell^\top v_\ell.

(Grosse et al., 2023, Bao et al., 8 May 2025)

This preserves the efficiency of K-FAC while correcting the curvature spectrum to align with the empirical mode-by-mode variances observed in the data.

3. Algorithmic and Computational Details

The EK-FAC workflow consists of:

  1. Maintaining exponential moving averages (EMA) for AA_\ell and SS_\ell on each mini-batch, with decay rate β0.95β \approx 0.95 (Grosse et al., 2023).
  2. Periodically (e.g., every KK steps) performing eigendecompositions for AA_\ell and SS_\ell at cost O(M3+P3)O(M^3 + P^3).
  3. Projecting per-example gradients into the Kronecker eigenbasis and accumulating empirical second moments (cost O(MP)O(MP) per batch and block).
  4. At influence-query time, precomputing all factors and applying the IHVP at cost O(M2P+MP2)O(M^2P + MP^2) per block.

Memory overhead per layer is M2+P2+MPM^2 + P^2 + MP due to storage of eigenvectors and the empirical diagonals; block-diagonal partitioning is recommended for very large layers.

Compared to alternatives:

4. Spectral Accuracy, Empirical Results, and Error Decomposition

EK-FAC's central advantage is its spectral fidelity: it aligns the block-wise curvature eigenvalue spectrum with the actual empirical second moment spectrum, correcting bias introduced by the Kronecker-product assumption. Empirical studies demonstrate:

  • Attribution fidelity: In small-scale (MLP/UCI) and moderate-scale (transformer) settings, EK-FAC outperforms K-FAC and approaches GGN/Hessian benchmarks in data attribution tasks, as quantified by Linear Data-modelling Score (LDS) and direct Hessian-inverse-verification (Grosse et al., 2023, Hong et al., 27 Sep 2025).
  • Spectral overlap: EK-FAC achieves markedly higher EvalOverlap (\textasciitilde 0.9 at 100 epochs) compared to K-FAC (\textasciitilde 0.75), with remaining gap explained by block-diagonal and Kronecker-factorization errors (Hong et al., 27 Sep 2025).
  • Runtime gains: At LLM scales, EK-FAC reduces IHVP wall time by at least an order of magnitude over LiSSA and Arnoldi (e.g., 3.57s per 500-candidate influence scan vs. 913s for CG; factor fitting time 1.17h for EK-FAC vs. 18.6h for TRAK) (Bao et al., 8 May 2025).
  • Error budget: The Kronecker-factorization step dominates total approximation error (explains 40–60% error in LDS); EK-FAC recovers about half that gap. Block-diagonality error grows with network depth, while the GGN vs. Hessian substitution is only impactful far from convergence (Hong et al., 27 Sep 2025).

5. Practical Implementation, Scalability, and Recommendations

  • Block Partitioning: To address GPU memory constraints in very large LLMs, EK-FAC applies block-diagonalization within layers, at some cost in spectral quality (Grosse et al., 2023).
  • Layer selection: State-of-the-art LLM pipelines include only linear transforms (MLP input/output, MHA projections) for EK-FAC; embeddings and normalization layers are typically omitted (Bao et al., 8 May 2025).
  • Update frequency and damping: Eigenbasis fitting is done every 1–2K steps and damping parameter λ0.1mean(eigenvalues)\lambda \approx 0.1 \cdot \text{mean}(\text{eigenvalues}) is used for stability (Grosse et al., 2023, Bao et al., 8 May 2025).
  • Empirical tuning: Damping, estimation batch size, and block size require tuning for system stability and runtime trade-offs (Bao et al., 8 May 2025).
  • IHVP computation: Once factors are cached, EK-FAC IHVPs are GPU-friendly and benefit from fused batch GEMM kernels for projection operations.

EK-FAC's empirical strengths are maximized when influence score fidelity is critical (e.g., forensic, debugging, curation tasks) in shallow or moderately-deep models, or when wall-clock constraints preclude iterative solvers. For very deep models or those with strong cross-layer coupling, hybridizations or block-GGN may become preferable (Hong et al., 27 Sep 2025).

6. Limitations and Open Challenges

  • Block-diagonal assumption: EK-FAC ignores cross-layer curvature, which can limit fidelity in very deep architectures (Grosse et al., 2023, Hong et al., 27 Sep 2025).
  • Applicability: Application is restricted to MLP and certain MHA sub-blocks; embeddings, unembedding, and normalization parameters are untreated for computational reasons (Grosse et al., 2023, Bao et al., 8 May 2025).
  • Nonlinear phenomena: All linearized curvature methods—EK-FAC included—fail to capture phenomena such as “grokking” or circuit formation (Grosse et al., 2023).
  • Hyperparameter sensitivity: Accuracy and stability depend on fitting frequency, batch-size, and choice of blocks (Bao et al., 8 May 2025).
  • Software/hardware constraints: Large-scale eigen-decompositions require significant CPU or mixed-precision compute; extension to arbitrary architectures or modalities (e.g., diffusion models or encoder–decoder LMs) remains an open direction (Bao et al., 8 May 2025).
  • Alternative methods: For absolute accuracy and if compute is unconstrained, matrix-free solvers (LiSSA, CG) are structurally unbiased but far slower (Hong et al., 27 Sep 2025).

7. Summary Table: Comparison of Curvature Approximations for IHVP

Method Structure Spectral Fidelity IHVP Cost
Hessian Full, unapproximated 1.00 O(D3)O(D^3)
GGN Linearized output, full \sim1.00 O(D3)O(D^3)
Block-GGN Block-diagonal <1.00<1.00 O(d3)O(\sum d_\ell^3)
EK-FAC Block-Kronecker, corrected $0.9$–$1.0$ O(M2P+MP2)O(M^2P+MP^2)
K-FAC Block-Kronecker, uncorrected $0.6$–$0.8$ O(M2P+MP2)O(M^2P+MP^2)

Above: Spectral fidelity measured by EvalOverlap, cost per IHVP per block, D=dD = \sum_\ell d_\ell.

EK-FAC is established as the method of choice for scalable, high-accuracy IHVPs and influence function analyses in billion-parameter models where the cost-fidelity trade-off is dominant and eigenvalue mismatch is otherwise a limiting factor (Grosse et al., 2023, Bao et al., 8 May 2025, Hong et al., 27 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EK-FAC Approximations.