Fisher-Orthogonal Projection (FOP)

Updated 4 July 2026

FOP is a family of projection methods that use the Fisher information metric to isolate meaningful gradient directions from redundant or task-preserving ones.
It is applied in large-batch training, continual learning, and sim-to-real neural operator tuning to maintain essential model structure while adapting.
By enforcing Fisher-orthogonality, FOP improves update stability and performance compared to standard Euclidean gradient methods.

Fisher-Orthogonal Projection (FOP) denotes a family of information-geometric projection procedures in which gradients or parameter updates are constrained to be orthogonal, under a Fisher-induced metric, to directions deemed redundant, protected, or task-preserving. In the literature provided here, the term appears in at least three closely related roles: as the core mechanism in the large-batch natural-gradient method "Fisher-Orthogonal Projection Methods for Natural Gradient Descent with Large Batches" (Lu et al., 19 Aug 2025), as the central projection principle in the continual-learning optimizer "Fisher-Orthogonal Projected Natural Gradient Descent for Continual Learning" (Garg et al., 19 Jan 2026), and as the preservation mechanism used by PhysGuard for sim-to-real neural PDE surrogates (Zhou et al., 15 Jun 2026). Across these settings, FOP is used to preserve useful structure while still permitting adaptation or descent.

1. Conceptual scope and unifying idea

The common principle behind FOP is that not all gradient components should be treated equally. In standard first-order updates, directions are separated only by their Euclidean coordinates. FOP instead uses the Fisher information matrix to define which directions are meaningful, redundant, or dangerous to move along. In the large-batch setting, the projected object is a gradient-difference term extracted from two sub-batches; in continual learning, the projected object is the update for the current task relative to gradients associated with previous tasks; in PhysGuard, the protected directions are top Fisher eigendirections estimated from simulation data and removed from fine-tuning updates (Lu et al., 19 Aug 2025, Garg et al., 19 Jan 2026, Zhou et al., 15 Jun 2026).

Setting	Projected quantity	Reference or protected directions
Large-batch natural gradient	$g_{\rm diff}$	$g_{\rm avg}$ under the Fisher metric
Continual learning	Update $v$ or projected gradient	Previous-task gradients $G$ under $F_{\rm old}$
PhysGuard	Fine-tuning gradient $g$	Top Fisher eigendirections $U$

This shared geometry is important because each paper frames Euclidean orthogonality as insufficient for the relevant invariances. The continual-learning formulation explicitly contrasts Fisher-orthogonal constraints with methods that operate in Euclidean parameter space. The large-batch formulation uses Fisher-orthogonality to isolate intra-batch variability that is not already represented by the average gradient. PhysGuard uses Fisher-derived sensitive directions to preserve low-frequency physical structure during sim-to-real adaptation.

2. Information-geometric formulation

In the large-batch formulation, model parameters are $\theta \in \mathbb{R}^P$ , the Fisher information matrix is

$F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],$

and the Fisher metric is

$\langle u,v\rangle_F \;=\; u^\top F\,v.$

A large mini-batch $g_{\rm avg}$ 0 is split into two disjoint sub-batches $g_{\rm avg}$ 1, yielding

$g_{\rm avg}$ 2

with

$g_{\rm avg}$ 3

The Fisher-orthogonal projection removes from $g_{\rm avg}$ 4 any component already present in $g_{\rm avg}$ 5: $g_{\rm avg}$ 6 so that $g_{\rm avg}$ 7. The corrected gradient is then

$g_{\rm avg}$ 8

with the adaptive weight

$g_{\rm avg}$ 9

Using damping $v$ 0, the final update is

$v$ 1

This construction is presented as a variance-aware update direction that leverages gradients from two sub-batches and enhances the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher metric (Lu et al., 19 Aug 2025).

In continual learning, the formulation is explicitly task-sequential. Let $v$ 2, let $v$ 3 denote the loss of task $v$ 4, let $v$ 5, and let

$v$ 6

After tasks $v$ 7, collect final gradients $v$ 8 as columns of $v$ 9 and denote by $G$ 0 the Fisher matrix estimated on previous-task data. The Fisher-orthogonal complement is the set of all parameter increments $G$ 1 satisfying

$G$ 2

To project an arbitrary vector $G$ 3, define

$G$ 4

compute

$G$ 5

and obtain a projected vector satisfying $G$ 6. The resulting projected natural-gradient update solves a constrained problem that maximizes progress on the new loss, stays within a trust-region in the new-task Fisher metric, and is Fisher-orthogonal to old tasks: $G$ 7 If the orthogonality constraint is dropped, $G$ 8 and the method recovers ordinary natural gradient; if $G$ 9 and $F_{\rm old}$ 0, it recovers Euclidean orthogonal-gradient descent (Garg et al., 19 Jan 2026).

3. Geometric properties and interpretation

The continual-learning formulation states three geometric properties. First, the update $F_{\rm old}$ 1 is reparameterization-invariant because all norms and projections are defined via the Fisher information. Second, it guarantees descent in the Fisher metric, with $F_{\rm old}$ 2. Third, by enforcing $F_{\rm old}$ 3, the second-order change in KL divergence on each previous task is zero. The exposition further states that this ensures that moving by such a $F_{\rm old}$ 4 causes, to second order, no change in the model’s output distributions on prior tasks (Garg et al., 19 Jan 2026).

The large-batch formulation emphasizes a different, though related, geometric role. It is motivated by the claim that very large batches suppress gradient noise and that high damping can wash out the curvature information in KFAC. FOP is introduced specifically to restore useful curvature at large batch sizes by injecting a variance-aware correction into the natural-gradient step, while projecting out redundant directions under the Fisher metric so that only genuinely new curvature information is added (Lu et al., 19 Aug 2025).

A recurrent misconception is to equate FOP with ordinary Euclidean orthogonalization. The primary formulations above do not do that: they define orthogonality with respect to the Fisher metric. PhysGuard presents a useful contrast. There, the protected subspace is first identified from the empirical Fisher eigenspectrum, but the online projector is written in Euclidean form,

$F_{\rm old}$ 5

The paper explicitly notes that one may view this under the Fisher-induced inner product similarly, but in practice $F_{\rm old}$ 6 under $F_{\rm old}$ 7 is equivalent once the columns of $F_{\rm old}$ 8 are Fisher eigenvectors scaled to unit $F_{\rm old}$ 9 norm (Zhou et al., 15 Jun 2026). This indicates that the decisive ingredient is often not the final algebraic appearance of the projector, but the Fisher-derived choice of the protected subspace.

4. Algorithms and computational profile

The continual-learning algorithm is presented with diagonal-Fisher approximations for both $g$ 0 and $g$ 1. The procedure is: initialize $g$ 2; train on task $g$ 3 normally; estimate diagonal Fisher $g$ 4; store last-task gradients $g$ 5; for each subsequent task, compute a diagonal $g$ 6 on a small random batch; for each minibatch compute $g$ 7, build

$g$ 8

solve $g$ 9 via $U$ 0 inversion, project

$U$ 1

set

$U$ 2

and update $U$ 3. After each task, update

$U$ 4

and append new gradients to $U$ 5 while keeping at most $U$ 6 columns. Here $U$ 7 is a small ridge for numerical stability, $U$ 8 averages the old-task Fisher, and $U$ 9 is folded into the learning rate $\theta \in \mathbb{R}^P$ 0. The cost of diagonal Fisher storage and multiplication is $\theta \in \mathbb{R}^P$ 1, storing $\theta \in \mathbb{R}^P$ 2 old gradients costs $\theta \in \mathbb{R}^P$ 3, forming $\theta \in \mathbb{R}^P$ 4 costs $\theta \in \mathbb{R}^P$ 5, inverting $\theta \in \mathbb{R}^P$ 6 costs $\theta \in \mathbb{R}^P$ 7, and the total per-batch extra cost is $\theta \in \mathbb{R}^P$ 8. The exposition states that in practice $\theta \in \mathbb{R}^P$ 9– $F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],$ 0 so $F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],$ 1 is negligible, and that a PreFisher variant stores $F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],$ 2 once per task to eliminate $F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],$ 3 in the inner loop (Garg et al., 19 Jan 2026).

The large-batch algorithm is correspondingly direct if a KFAC pipeline already exists. Each full batch is split into two equal halves, two gradients are computed, the Fisher-metric projection is applied, and the combined gradient is passed through the existing KFAC preconditioner instead of the mean gradient. The pseudocode states that the method requires two forward-backward passes per update, so approximately $F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],$ 4 the cost of a single gradient step. KFAC factors cost $F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],$ 5 per layer, inversions cost $F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],$ 6 but are typically done every few steps, memory stores two gradient vectors plus KFAC factors and inverses, and distributed training uses two global gradients in parallel over disjoint GPU groups with dual all-reduces. The paper further states that on modern multi-GPU nodes such as $F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],$ 7MI300X, the method can sustain batches of $F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],$ 8 with no extra memory beyond standard KFAC (Lu et al., 19 Aug 2025).

These algorithmic descriptions show that FOP is not a monolithic implementation. It is a projection principle that can be instantiated with diagonal Fisher approximations, Kronecker-factored preconditioners, layer-wise subspace estimates, or offline Gram-matrix decompositions, depending on the regime.

5. Reported empirical behavior

The continual-learning and large-batch papers report distinct evaluation protocols, but both present FOP as competitive precisely where unprojected updates lose either memory or curvature information (Garg et al., 19 Jan 2026, Lu et al., 19 Aug 2025).

Setting	Benchmark	Reported outcome
Continual learning	Split-MNIST & Rotated-MNIST	FOPNG achieved $F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],$ 9 vs OGD $\langle u,v\rangle_F \;=\; u^\top F\,v.$ 0 and EWC $\langle u,v\rangle_F \;=\; u^\top F\,v.$ 1
Continual learning	Split-CIFAR10	FOPNG $\langle u,v\rangle_F \;=\; u^\top F\,v.$ 2 vs OGD $\langle u,v\rangle_F \;=\; u^\top F\,v.$ 3, EWC $\langle u,v\rangle_F \;=\; u^\top F\,v.$ 4
Continual learning	Split-CIFAR100	FOPNG $\langle u,v\rangle_F \;=\; u^\top F\,v.$ 5 vs OGD $\langle u,v\rangle_F \;=\; u^\top F\,v.$ 6, EWC $\langle u,v\rangle_F \;=\; u^\top F\,v.$ 7
Continual learning	Permuted-MNIST	FOPNG performed slightly worse than EWC
Large-batch training	CIFAR-10, BS= $\langle u,v\rangle_F \;=\; u^\top F\,v.$ 8	SGD needed $\langle u,v\rangle_F \;=\; u^\top F\,v.$ 9 s, KFAC $g_{\rm avg}$ 00 s, FOP $g_{\rm avg}$ 01 s
Large-batch training	CIFAR-10, BS= $g_{\rm avg}$ 02	only FOP reached $g_{\rm avg}$ 03 in $g_{\rm avg}$ 04 s
Large-batch training	CIFAR-10, BS= $g_{\rm avg}$ 05	FOP in $g_{\rm avg}$ 06 s
Large-batch training	ImageNet-100, BS= $g_{\rm avg}$ 07	KFAC $g_{\rm avg}$ 08 s, FOP $g_{\rm avg}$ 09 s
Large-batch training	ImageNet-1K, BS= $g_{\rm avg}$ 10	only FOP hit $g_{\rm avg}$ 11 in $g_{\rm avg}$ 12 min

For continual learning, the paper also states that performance is robust to the extra hyperparameters $g_{\rm avg}$ 13 and $g_{\rm avg}$ 14, with a good default of $g_{\rm avg}$ 15 and $g_{\rm avg}$ 16, while wall-clock overhead is $g_{\rm avg}$ 17– $g_{\rm avg}$ 18 above EWC/OGD and is dominated by extra Fisher-vector multiplies (Garg et al., 19 Jan 2026). The same section notes an important caveat: on Permuted-MNIST, the method is slightly worse than EWC, likely because the random permutations create highly OOD tasks; on more realistic tasks with gradual shifts it consistently outperforms.

For large-batch training, the evaluation spans CIFAR-10 with ResNet-18, ImageNet-100 with T2T-ViT, ImageNet-1K with ResNet-50, and long-tailed CIFAR with ResNet-32. The paper further reports that on ImageNet-100 at BS= $g_{\rm avg}$ 19, AdamW required $g_{\rm avg}$ 20 s, KFAC $g_{\rm avg}$ 21 s, and FOP $g_{\rm avg}$ 22 s, whereas at BS= $g_{\rm avg}$ 23 KFAC required $g_{\rm avg}$ 24 s and FOP $g_{\rm avg}$ 25 s. On ImageNet-1K at BS= $g_{\rm avg}$ 26, SGD required $g_{\rm avg}$ 27 min, KFAC $g_{\rm avg}$ 28 min, and FOP $g_{\rm avg}$ 29 min. On long-tailed CIFAR, FOP reduces Top-1 error by $g_{\rm avg}$ 30– $g_{\rm avg}$ 31 relative to strong baselines and by $g_{\rm avg}$ 32 vs KFAC (Lu et al., 19 Aug 2025).

6. Extensions and broader significance

PhysGuard extends the projection idea beyond optimization stability and continual learning into sim-to-real adaptation for neural operators. Starting from pretrained parameters $g_{\rm avg}$ 33 and per-sample simulation gradients

$g_{\rm avg}$ 34

it forms the empirical Fisher

$g_{\rm avg}$ 35

where $g_{\rm avg}$ 36 stacks the sample gradients row-wise. Because direct eigendecomposition of the $g_{\rm avg}$ 37 Fisher is infeasible when $g_{\rm avg}$ 38 is large, the paper uses the Gram matrix

$g_{\rm avg}$ 39

whose nonzero eigenvalues coincide with those of $g_{\rm avg}$ 40. In practice this is done per layer: build $g_{\rm avg}$ 41, form $g_{\rm avg}$ 42, eigendecompose it, and recover Fisher eigenvectors as

$g_{\rm avg}$ 43

Only a compact leading subspace is protected: choose the minimal $g_{\rm avg}$ 44 such that

$g_{\rm avg}$ 45

with $g_{\rm avg}$ 46 in all experiments, then define $g_{\rm avg}$ 47 and project each fine-tuning gradient as

$g_{\rm avg}$ 48

The reported complexity is $g_{\rm avg}$ 49 for gradient collection, $g_{\rm avg}$ 50 for Gram construction, $g_{\rm avg}$ 51 for the SVD of $g_{\rm avg}$ 52, $g_{\rm avg}$ 53 for recovering $g_{\rm avg}$ 54, and $g_{\rm avg}$ 55 per layer for each online projection. The paper states that all steps fit on a single GPU and take minutes offline for $g_{\rm avg}$ 56, with only a few milliseconds of per-step overhead during fine-tuning (Zhou et al., 15 Jun 2026).

PhysGuard also gives a domain-specific interpretation of Fisher-sensitive directions. Its spectral probe experiment states that the dominant Fisher directions are strongly associated with low-frequency output structures, and that projecting away from them preserves large-scale physics while still permitting adaptation in the remaining nullspace. Under severe domain shift, the paper reports that it reduces low-frequency error by up to $g_{\rm avg}$ 57 compared to standard fine-tuning while maintaining adaptability (Zhou et al., 15 Jun 2026).

Taken together, these works suggest that FOP is best understood not as one optimizer but as a reusable information-geometric design pattern. In one regime it revives natural-gradient curvature at extreme batch size; in another it preserves prior task outputs during sequential learning; in another it protects physics-critical subspaces during sim-to-real transfer. A plausible implication is that the central research question is not whether projection should be used, but which Fisher-defined subspace should be protected or amplified for a given learning regime.