K-FAC: Scalable Second-Order Optimization
- K-FAC is a scalable second-order optimization algorithm that approximates the Fisher information matrix using block-diagonal Kronecker factorization to reduce computational costs.
- It efficiently preconditions gradients across various architectures such as feedforward, convolutional, recurrent, and weight-sharing networks through layerwise operations.
- Extensions including low-rank updates and distributed variants accelerate convergence, making K-FAC robust for large-scale deep learning tasks.
Kronecker-Factored Approximate Curvature (K-FAC) is a scalable second-order optimization algorithm designed for deep learning. It enables approximate natural-gradient descent by exploiting a blockwise Kronecker-factorization of the Fisher information or generalized Gauss–Newton matrix. By reducing matrix inversion and storage costs to layerwise operations on small factors, K-FAC enables efficient preconditioning of gradients for modern feedforward, convolutional, recurrent, and weight-sharing architectures. This framework has been widely extended (e.g., to distributed training, low-rank updates, continual learning, and domain-specific workloads such as PINNs and deep hedging), and forms the backbone of several state-of-the-art scalable optimizers.
1. Block-Diagonal Kronecker-Factored Fisher Approximation
In classical natural-gradient learning, the parameter update
requires the inverse of the Fisher information matrix
K-FAC introduces a twofold factorization for computational tractability (Martens et al., 2015):
- Block-diagonalization: The Fisher is approximated as block-diagonal across network layers, , neglecting cross-layer blocks.
- Kronecker factorization: Each block is further approximated by a Kronecker product: , with
where are input activations to layer and are backpropagated gradients at the layer outputs.
This structure enables computationally efficient inversion: reducing storage and inversion costs from to per layer and making large-scale curvature-aware optimization practical.
2. Damped Natural-Gradient Update and Efficient Implementation
K-FAC employs Tikhonov damping for numerical stability: The approximate natural-gradient step is: where the split of the damping parameter is determined heuristically. Factor estimates are formed as exponentially weighted moving averages over minibatches. The cost of each update is dominated by the eigendecomposition and inversion of the Kronecker factors, both being small matrices compared to the full Hessian or Fisher block.
3. Extensions: Recurrent, Convolutional, Weight-Sharing, and BatchNorm
K-FAC generalizes to a range of architectures:
- Convolutional layers: Extension to KFC (Kronecker Factors for Convolution) factors over input channels and filter patches, respecting spatial structure (Grosse et al., 2016).
- Recurrent layers: A blockwise Kronecker factorization is formed across time-unfolded Jacobians, maintaining scalability (Luk et al., 2018, Enkhbayar, 22 Nov 2024).
- Weight-sharing and modern architectures: K-FAC's "expand" and "reduce" variants address core layers in transformers and GNNs, handling shared weights in attention, graph convolution, and other modules (Eschenhagen et al., 2023, Dangel et al., 24 May 2024).
- BatchNorm and continual learning: Extended K-FAC (XK-FAC) accounts for inter-example curvature induced by mini-batch statistics and supports affine parameter merging, essential for learning in BatchNorm-equipped and non-stationary scenarios (Lee et al., 2020).
4. Distributed, Memory- and Communication-Optimized K-FAC
The need for multi-GPU scalability has driven innovations in distributed K-FAC:
- Traditional D-KFAC: Each worker computes local Kronecker factors and inverts them, with allreduce stages for aggregation, leading to high computation and communication costs (Pauloski et al., 2020, Osawa et al., 2018).
- Smart parallelism: SPD-KFAC pipelines computation and factor communication, fuses small tensors, and load-balances matrix inversion to minimize stragglers and network latency (Shi et al., 2021).
- Adaptive frameworks: KAISA exposes a grad_worker_frac knob to interpolate between memory- and communication-optimal regimes, balancing eigenbasis cache and broadcasts (Pauloski et al., 2021).
- Distributed preconditioning: DP-KFAC assigns each Kronecker factor to a unique worker, eliminating factor communication while preserving theoretical convergence properties. This reduces computation, memory, and communication by $1.5$-- over previous D-KFAC methods (Zhang et al., 2022).
- Empirically, these frameworks achieve $10$-- reductions in wall-clock time relative to state-of-the-art, without loss in convergence.
5. Accelerated and Low-Rank K-FAC Variants
Several approaches further accelerate K-FAC's core routines:
- Iterative K-FAC (CG-FAC): Uses conjugate gradient with Kronecker-matrix-free matvecs, eliminating explicit storage and inversion of the factors. For large (parameter dimension), this yields much lower time and memory complexity than direct inversion (Chen, 2021).
- Randomized SVD/EVD (RS-KFAC): Exploits the rapid spectral decay in EMA Kronecker factors, using randomized low-rank decompositions to cut per-layer inversion from to . Empirical results show speed-up per epoch and speed-up to target accuracy, compared to standard K-FAC (Puiu, 2022).
- Online/Brand updates: Incremental low-rank updates further bring factor inversion and application down to nearly linear time in layer size on fully connected layers, offering additional trade-offs between accuracy and speed (Puiu, 2022).
- Kronecker-factored eigenbasis (EKFAC): Tracks full diagonal variances in the Kronecker eigenbasis rather than enforcing the Kronecker product form, providing a provably better Frobenius-norm approximation to the Fisher at a small extra cost (George et al., 2018). This improves convergence especially when curvature factors drift between infrequent decompositions.
- First-order variants (FOOF): K-FAC’s empirical success is linked to its alignment with a first-order optimizer on neuron pre-activations, with FOOF discarding the G-factor inverse and achieving comparable or faster convergence (Benzing, 2022).
6. Empirical Performance, Applications, and Limitations
K-FAC and its distributed variants consistently accelerate convergence on large-scale supervised, reinforcement learning, and physics-informed tasks:
- Supervised learning: On ResNet-50/ImageNet-1k, K-FAC and distributed variants routinely match MLPerf reference accuracy in $18$-- less wall-clock time than SGD as batch size and GPU count grow (Pauloski et al., 2020, Osawa et al., 2018, Pauloski et al., 2021).
- Reinforcement learning: ACKTR leverages K-FAC in a trust-region natural policy gradient, achieving $2$-- sample efficiency and superior stability in both discrete and continuous control domains (Wu et al., 2017).
- Physics-informed learning: PINN-optimized K-FAC variants using Taylor-mode automatic differentiation achieve faster convergence than first-order and quasi-Newton methods, scaling routinely to -- parameters (Dangel et al., 24 May 2024).
- Financial modeling: In deep hedging for sequential risk management, K-FAC with LSTM architectures reduces transaction cost by and P&L variance by relative to Adam, with substantial Sharpe improvement (Enkhbayar, 22 Nov 2024).
- Modern architectures: Properly-tuned K-FAC ("expand"/"reduce") delivers $50$-- iteration and wall-clock reduction compared to first-order optimizers across vision transformers and graph nets (Eschenhagen et al., 2023).
However, large-batch scalability is fundamentally constrained: diminishing returns occur at critical batch sizes nearly identical to SGD, and K-FAC requires more sensitive hyperparameter tuning (learning rate, damping, and update frequency) at scale (Ma et al., 2019). For ultra-large batches, first-order methods (with additional tricks, e.g., LARS or LAMB) or analytic preconditioners may be preferable.
7. Theoretical Properties, Invariances, and Open Problems
K-FAC is constructed as a Riemannian natural gradient with respect to an independence-metric, and thus is provably invariant to affine transformations of hidden activations, both in fully connected and convolutional/recurrent networks (Luk et al., 2018, Grosse et al., 2016). This invariance guarantees that whitening, centering, or scaling activations does not alter the optimizer trajectory—an important robustness property.
The Kronecker factorization is exact in deep linear/weight-sharing toy models but introduces eigenbasis and spectral mismatch in nonlinear regimes. Recent analyses (e.g., on influence functions) demonstrate that the largest source of error in K-FAC is not block-diagonalization, but the spectral discrepancy induced by enforcing separable Kronecker structure (Hong et al., 27 Sep 2025). Hybrid and low-rank corrections (e.g., EKFAC, ASTRA) offer avenues to alleviate this, and future directions include adaptive block granularity, dynamic switching between block-diagonal and full GGN, and tighter integration with scalable Laplace-based Bayesian methods.
Summary Table: Core K-FAC Operations and Complexity
| Step | Standard K-FAC | CG-FAC / RS-KFAC / Brand Updates |
|---|---|---|
| Factor estimation | ||
| Inversion (per factor) | (CG-FAC), (RS-KFAC) | |
| Memory per factor | or (low-rank) | |
| Communication (distributed) | per layer | with worker assignment (Zhang et al., 2022) |
Key: is layer width, is minibatch, is retained rank, is worker count.
K-FAC remains a cornerstone of scalable second-order optimization in deep learning, with a spectrum of algorithmic and theoretical extensions enabling robust preconditioning across architectures, data modalities, and distributed systems. Ongoing research is focused on relaxing the Kronecker-product restrictions, improving eigen-spectral fidelity, and further reducing the routine's computational bottlenecks.