Kronecker-Factored Approx. Curvature (K-FAC)

Updated 5 December 2025

K-FAC is a second-order optimization method that approximates the Fisher Information Matrix with a Kronecker factorization on a per-layer basis.
It efficiently preconditions gradients to speed up convergence in deep and distributed neural network training.
K-FAC’s modular approach is applied across various architectures, including CNNs, transformers, and reinforcement learning models.

Kronecker-Factored Approximate Curvature (K-FAC) is a structured second-order optimization algorithm designed to scale natural gradient methods to deep neural networks. It achieves tractable preconditioning by efficiently approximating the Fisher Information Matrix (FIM) or Gauss-Newton curvature with Kronecker factorization on a per-layer basis. Originating in the context of supervised learning but now extended to various architectures and domains, K-FAC has become a foundational technique for accelerating convergence and improving statistical efficiency in large-scale deep learning, reinforcement learning, continual learning, and physics-informed neural networks (Martens et al., 2015, Grosse et al., 2016, Osawa et al., 2018, Eschenhagen et al., 2023, Dangel et al., 2024).

1. Theoretical Foundations: Fisher Factorization and Layerwise Approximation

K-FAC begins with the observation that the ideal natural gradient step is

$\Delta\theta = -\eta F^{-1}\nabla_\theta L$

where $F$ is the Fisher Information Matrix,

$F=E[\nabla_\theta \log p(y|x;\theta) \nabla_\theta \log p(y|x;\theta)^T].$

Direct computation and inversion of $F$ is infeasible in deep nets due to its size. K-FAC applies two critical approximations:

Block-diagonalization: $F \approx \mathrm{blockdiag}(F_1, \dots, F_L)$ , ignoring inter-layer parameter dependencies.
Kronecker factorization: For layer $\ell$ (fully connected: $W_\ell$ ), $F_\ell \approx A_{\ell-1} \otimes G_\ell$ , where $A_{\ell-1} = E[a_{\ell-1}a_{\ell-1}^T]$ and $G_\ell = E[g_\ell g_\ell^T]$ , with $a_{\ell-1}$ denoting pre-activation inputs and $g_\ell$ the backpropagated gradients at that layer (Martens et al., 2015, Grosse et al., 2016, Osawa et al., 2018).

For convolutional and other weight-sharing layers, the Kronecker factorization generalizes via spatial or position indices (Grosse et al., 2016, Eschenhagen et al., 2023, Dangel et al., 2024). For example, in convolutional settings, K-FAC uses statistical assumptions (spatial homogeneity, independence) to factor the block as the Kronecker product of input-patch and pre-activation gradient covariances.

A Tikhonov-style damping parameter is added to each factor for numerical stability, so the invertible preconditioning is performed with $(A+\gamma I)^{-1}, (G+\gamma I)^{-1}$ (Martens et al., 2015, Grosse et al., 2016).

2. Practical Implementation: Factor Estimation, Inversion, and Distributed Algorithms

K-FAC implementations estimate $A$ and $G$ on each mini-batch, using exponential moving averages to track curvature statistics (Martens et al., 2015, Grosse et al., 2016, Osawa et al., 2018, Pauloski et al., 2021). The natural-gradient update, layerwise, is

$\Delta W_\ell = -\eta (G_\ell + \gamma I)^{-1} [\nabla_{W_\ell} L] (A_{\ell-1} + \gamma I)^{-1}$

with gradient $\nabla_{W_\ell} L$ computed by backprop.

Inverting the Kronecker factors is reduced to operations of $O(d^3)$ per layer, where $d$ is layer width, instead of inverting the full block of size $d^2$ , yielding significant computational savings.

Distributed K-FAC algorithms decompose the workload across GPUs. In data-parallel schemes, Kronecker factors are constructed locally and aggregated by communication primitives (all-reduce, scatter-gather) for global averaging and inversion (Osawa et al., 2018, Pauloski et al., 2020, Pauloski et al., 2021, Zhang et al., 2022, Shi et al., 2021). Model-parallel variants assign the inversion work for individual layers or subsets to different devices to optimize memory and computation (Zhang et al., 2022, Shi et al., 2021). Strategies such as aggressive staleness (updating Kronecker factors/inverses less frequently) and symmetry-aware communication further optimize bandwidth and throughput for extremely large batches (up to 131k) (Osawa et al., 2018).

KAISA introduces a continuum between pure memory and pure communication strategies in distributed K-FAC, providing a heuristic to allocate layerwise preconditioning responsibility among nodes for optimal speed given hardware and memory constraints (Pauloski et al., 2021, Zhang et al., 2022).

3. Extensions, Variants, and Acceleration Techniques

Several K-FAC variants target major efficiency bottlenecks:

CG-FAC solves the preconditioning step using conjugate gradients and does not require explicit formation of Kronecker factors or full block matrices, reducing per-step time and memory complexity to linear in parameter count (Chen, 2021).
Randomized K-FAC (RS-KFAC/SRE-KFAC) exploits the spectral decay of Kronecker factors (due to EMA construction) to use randomized SVD/EVD for low-rank inversion, reducing the cubic per-layer cost to quadratic, achieving up to 3× speed-ups over classic K-FAC at the same test accuracy (Puiu, 2022, Puiu, 2022).
Online SVD updates with Brand’s algorithm further reduce per-layer cost to $O(dr^2)$ , enabling linear scaling in “layer size” under certain architectural and batch-size constraints (Puiu, 2022).
EKFAC replaces the Kronecker product eigenvalue rescaling with a diagonal approximation in the Kronecker-factor eigenbasis, yielding better alignment to the true Fisher and consistently improved training loss and time-to-target on deep networks (George et al., 2018).
Weight-sharing and generic architectures: K-FAC-expand/reduce address weight-shared layers such as self-attention, GNNs, pinning exactness to “expand” or “reduce” settings, providing principled efficiency vs accuracy trade-offs for modern architectures (Eschenhagen et al., 2023, Dangel et al., 2024).

Additional enhancements such as mixed-precision storage, partial block-diagonalization for batch-norm layers, and tailored curvature modeling for continual learning and PINNs further increase tractability in specialized domains (Enkhbayar, 2024, Lee et al., 2020, Dangel et al., 2024).

4. Empirical Performance, Scalability, and Limitations

Empirical results demonstrate that K-FAC achieves 2–4× reduction in iteration count compared to state-of-the-art SGD-based optimizers (with warmup, LARS, etc.), while preserving or exceeding generalization at extreme batch sizes (up to 131,072), as in ResNet-50/ImageNet where 75% top-1 is reached in 35–100 epochs, depending on batch size (Osawa et al., 2018, Pauloski et al., 2020).

Distributed K-FAC—when combined with stale factor updates, symmetry-aware reduction, and communication optimization—attains 15–25% wall-clock speedups on ImageNet-1k benchmarks versus SGD baselines at cluster scale (Pauloski et al., 2020, Osawa et al., 2018). On large models and transformer workloads, the KAISA system yields 18.1–36.3% faster convergence with identical final accuracy, and under memory constraints achieves up to 41.6% gains on BERT-Large (Pauloski et al., 2021).

Nevertheless, limitations are apparent. K-FAC’s wall-clock speedups plateau or degrade at critical batch sizes, often below state-of-the-art SGD for very large models and clusters; this is due to cubic per-layer costs, straggler imbalances, and communication overhead (Ma et al., 2019). Hyperparameter sensitivity (damping, update frequency) is higher than SGD, complicating tuning (Ma et al., 2019). The original “second-order” semantics of K-FAC are not always realized: ablations show K-FAC’s effectiveness stems as much from per-layer adaptive preconditioning as from true curvature modeling, and, under high damping, K-FAC acts as a specialized first-order optimizer (“FOOF”) (Benzing, 2022).

5. Applications Across Architectures and Domains

K-FAC has been successfully applied to:

Convolutional neural networks: Kronecker Factors for Convolution (KFC) adapts K-FAC to conv layers using patch and output channel statistics; achieves 10–20× reduction in updates over SGD and enables training with ultra-large batches (Grosse et al., 2016, Pauloski et al., 2020).
Reinforcement learning: ACKTR applies K-FAC with trust-region policy losses in actor-critic settings, yielding 2–3× gains in sample efficiency on Atari and MuJoCo compared to PPO/TRPO (Wu et al., 2017).
Modern architectures: Weight-sharing extensions handle transformers, GNNs, and PINNs via “expand”/“reduce” settings, with wall-clock improvements of 30–50% over SGD for Wide-ResNet, ViT, and GNNs (Eschenhagen et al., 2023, Dangel et al., 2024).
Continual learning and batch-norm: Extended K-FAC incorporates inter-example dependencies due to batch-norm, mitigating catastrophic forgetting and improving accuracy in sequential task settings (Lee et al., 2020).
Finance (Deep Hedging): K-FAC with LSTM/sequence data drastically reduces transaction costs (by 78.3%) and P&L variance (by 34.4%) over Adam under realistic Heston models (Enkhbayar, 2024).

6. Algorithmic Invariance, Optimality, and Theoretical Guarantees

K-FAC can be interpreted as a natural gradient with respect to an independence metric on the parameter manifold, explicitly constructing a Riemannian metric whose geodesics correspond to the preconditioned step (Luk et al., 2018). This framework establishes K-FAC’s invariance to affine reparameterizations of each layer’s activations—including centering, whitening, or replacing activation nonlinearities—ensuring performance is robust to architectural transformations (Martens et al., 2015, Luk et al., 2018, Grosse et al., 2016).

Error in the Kronecker factorization is controlled by higher cumulants in the joint distribution of activations and derivatives; the assumption becomes better justified with large batches and near-Gaussian layer statistics (Martens et al., 2015, Grosse et al., 2016). Convergence guarantees for approximate natural gradient under K-FAC, including in the distributed preconditioning regime, carry over under standard smoothness and strong convexity assumptions. Theoretical works show that, at least locally, K-FAC converges linearly when these conditions hold (Zhang et al., 2022).

7. Schematic Algorithm and Complexity Overview

The core K-FAC update, per mini-batch, in a canonical batchwise setting is:

Step	Complexity per layer (fully connected)	Description
Forward/backward pass	$O(dm)$	Activations and gradients collection
Covariance update	$O(d^2 m)$	Compute $A$ , $G$ as batch or running averages
Inversion (every K it.)	$O(d^3)$	Eigendecomposition or Cholesky for each factor
Precondition gradient	$O(d^3)$ or $O(d^2)$ (vec trick)	Transforming gradients with Kronecker factors
Update parameters	$O(d^2)$	Apply preconditioned gradient

Here, $d$ is the typical layer width, $m$ the batch size. For convolutional or weight-sharing architectures, analogous scaling in their effective dimension applies (Martens et al., 2015, Grosse et al., 2016, Eschenhagen et al., 2023).

In summary, Kronecker-Factored Approximate Curvature leverages principled second-order approximations—using the Kronecker product structure of per-layer Fisher blocks—to enable scalable, practical natural gradient optimization in deep learning. Its efficacy depends on architectural specifics, batch size, distributed system design, and the regime of training, but it remains a state-of-the-art method for rapid, robust convergence in a wide variety of large-scale learning tasks (Martens et al., 2015, Osawa et al., 2018, Eschenhagen et al., 2023, Pauloski et al., 2021, Pauloski et al., 2020, Enkhbayar, 2024).