Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fisher-Orthogonal Projection (FOP)

Updated 4 July 2026
  • FOP is a family of projection methods that use the Fisher information metric to isolate meaningful gradient directions from redundant or task-preserving ones.
  • It is applied in large-batch training, continual learning, and sim-to-real neural operator tuning to maintain essential model structure while adapting.
  • By enforcing Fisher-orthogonality, FOP improves update stability and performance compared to standard Euclidean gradient methods.

Fisher-Orthogonal Projection (FOP) denotes a family of information-geometric projection procedures in which gradients or parameter updates are constrained to be orthogonal, under a Fisher-induced metric, to directions deemed redundant, protected, or task-preserving. In the literature provided here, the term appears in at least three closely related roles: as the core mechanism in the large-batch natural-gradient method "Fisher-Orthogonal Projection Methods for Natural Gradient Descent with Large Batches" (Lu et al., 19 Aug 2025), as the central projection principle in the continual-learning optimizer "Fisher-Orthogonal Projected Natural Gradient Descent for Continual Learning" (Garg et al., 19 Jan 2026), and as the preservation mechanism used by PhysGuard for sim-to-real neural PDE surrogates (Zhou et al., 15 Jun 2026). Across these settings, FOP is used to preserve useful structure while still permitting adaptation or descent.

1. Conceptual scope and unifying idea

The common principle behind FOP is that not all gradient components should be treated equally. In standard first-order updates, directions are separated only by their Euclidean coordinates. FOP instead uses the Fisher information matrix to define which directions are meaningful, redundant, or dangerous to move along. In the large-batch setting, the projected object is a gradient-difference term extracted from two sub-batches; in continual learning, the projected object is the update for the current task relative to gradients associated with previous tasks; in PhysGuard, the protected directions are top Fisher eigendirections estimated from simulation data and removed from fine-tuning updates (Lu et al., 19 Aug 2025, Garg et al., 19 Jan 2026, Zhou et al., 15 Jun 2026).

Setting Projected quantity Reference or protected directions
Large-batch natural gradient gdiffg_{\rm diff} gavgg_{\rm avg} under the Fisher metric
Continual learning Update vv or projected gradient Previous-task gradients GG under FoldF_{\rm old}
PhysGuard Fine-tuning gradient gg Top Fisher eigendirections UU

This shared geometry is important because each paper frames Euclidean orthogonality as insufficient for the relevant invariances. The continual-learning formulation explicitly contrasts Fisher-orthogonal constraints with methods that operate in Euclidean parameter space. The large-batch formulation uses Fisher-orthogonality to isolate intra-batch variability that is not already represented by the average gradient. PhysGuard uses Fisher-derived sensitive directions to preserve low-frequency physical structure during sim-to-real adaptation.

2. Information-geometric formulation

In the large-batch formulation, model parameters are θRP\theta \in \mathbb{R}^P, the Fisher information matrix is

F  =  Expdata[θlogpθ(x)θlogpθ(x)],F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],

and the Fisher metric is

u,vF  =  uFv.\langle u,v\rangle_F \;=\; u^\top F\,v.

A large mini-batch gavgg_{\rm avg}0 is split into two disjoint sub-batches gavgg_{\rm avg}1, yielding

gavgg_{\rm avg}2

with

gavgg_{\rm avg}3

The Fisher-orthogonal projection removes from gavgg_{\rm avg}4 any component already present in gavgg_{\rm avg}5: gavgg_{\rm avg}6 so that gavgg_{\rm avg}7. The corrected gradient is then

gavgg_{\rm avg}8

with the adaptive weight

gavgg_{\rm avg}9

Using damping vv0, the final update is

vv1

This construction is presented as a variance-aware update direction that leverages gradients from two sub-batches and enhances the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher metric (Lu et al., 19 Aug 2025).

In continual learning, the formulation is explicitly task-sequential. Let vv2, let vv3 denote the loss of task vv4, let vv5, and let

vv6

After tasks vv7, collect final gradients vv8 as columns of vv9 and denote by GG0 the Fisher matrix estimated on previous-task data. The Fisher-orthogonal complement is the set of all parameter increments GG1 satisfying

GG2

To project an arbitrary vector GG3, define

GG4

compute

GG5

and obtain a projected vector satisfying GG6. The resulting projected natural-gradient update solves a constrained problem that maximizes progress on the new loss, stays within a trust-region in the new-task Fisher metric, and is Fisher-orthogonal to old tasks: GG7 If the orthogonality constraint is dropped, GG8 and the method recovers ordinary natural gradient; if GG9 and FoldF_{\rm old}0, it recovers Euclidean orthogonal-gradient descent (Garg et al., 19 Jan 2026).

3. Geometric properties and interpretation

The continual-learning formulation states three geometric properties. First, the update FoldF_{\rm old}1 is reparameterization-invariant because all norms and projections are defined via the Fisher information. Second, it guarantees descent in the Fisher metric, with FoldF_{\rm old}2. Third, by enforcing FoldF_{\rm old}3, the second-order change in KL divergence on each previous task is zero. The exposition further states that this ensures that moving by such a FoldF_{\rm old}4 causes, to second order, no change in the model’s output distributions on prior tasks (Garg et al., 19 Jan 2026).

The large-batch formulation emphasizes a different, though related, geometric role. It is motivated by the claim that very large batches suppress gradient noise and that high damping can wash out the curvature information in KFAC. FOP is introduced specifically to restore useful curvature at large batch sizes by injecting a variance-aware correction into the natural-gradient step, while projecting out redundant directions under the Fisher metric so that only genuinely new curvature information is added (Lu et al., 19 Aug 2025).

A recurrent misconception is to equate FOP with ordinary Euclidean orthogonalization. The primary formulations above do not do that: they define orthogonality with respect to the Fisher metric. PhysGuard presents a useful contrast. There, the protected subspace is first identified from the empirical Fisher eigenspectrum, but the online projector is written in Euclidean form,

FoldF_{\rm old}5

The paper explicitly notes that one may view this under the Fisher-induced inner product similarly, but in practice FoldF_{\rm old}6 under FoldF_{\rm old}7 is equivalent once the columns of FoldF_{\rm old}8 are Fisher eigenvectors scaled to unit FoldF_{\rm old}9 norm (Zhou et al., 15 Jun 2026). This indicates that the decisive ingredient is often not the final algebraic appearance of the projector, but the Fisher-derived choice of the protected subspace.

4. Algorithms and computational profile

The continual-learning algorithm is presented with diagonal-Fisher approximations for both gg0 and gg1. The procedure is: initialize gg2; train on task gg3 normally; estimate diagonal Fisher gg4; store last-task gradients gg5; for each subsequent task, compute a diagonal gg6 on a small random batch; for each minibatch compute gg7, build

gg8

solve gg9 via UU0 inversion, project

UU1

set

UU2

and update UU3. After each task, update

UU4

and append new gradients to UU5 while keeping at most UU6 columns. Here UU7 is a small ridge for numerical stability, UU8 averages the old-task Fisher, and UU9 is folded into the learning rate θRP\theta \in \mathbb{R}^P0. The cost of diagonal Fisher storage and multiplication is θRP\theta \in \mathbb{R}^P1, storing θRP\theta \in \mathbb{R}^P2 old gradients costs θRP\theta \in \mathbb{R}^P3, forming θRP\theta \in \mathbb{R}^P4 costs θRP\theta \in \mathbb{R}^P5, inverting θRP\theta \in \mathbb{R}^P6 costs θRP\theta \in \mathbb{R}^P7, and the total per-batch extra cost is θRP\theta \in \mathbb{R}^P8. The exposition states that in practice θRP\theta \in \mathbb{R}^P9–F  =  Expdata[θlogpθ(x)θlogpθ(x)],F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],0 so F  =  Expdata[θlogpθ(x)θlogpθ(x)],F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],1 is negligible, and that a PreFisher variant stores F  =  Expdata[θlogpθ(x)θlogpθ(x)],F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],2 once per task to eliminate F  =  Expdata[θlogpθ(x)θlogpθ(x)],F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],3 in the inner loop (Garg et al., 19 Jan 2026).

The large-batch algorithm is correspondingly direct if a KFAC pipeline already exists. Each full batch is split into two equal halves, two gradients are computed, the Fisher-metric projection is applied, and the combined gradient is passed through the existing KFAC preconditioner instead of the mean gradient. The pseudocode states that the method requires two forward-backward passes per update, so approximately F  =  Expdata[θlogpθ(x)θlogpθ(x)],F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],4 the cost of a single gradient step. KFAC factors cost F  =  Expdata[θlogpθ(x)θlogpθ(x)],F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],5 per layer, inversions cost F  =  Expdata[θlogpθ(x)θlogpθ(x)],F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],6 but are typically done every few steps, memory stores two gradient vectors plus KFAC factors and inverses, and distributed training uses two global gradients in parallel over disjoint GPU groups with dual all-reduces. The paper further states that on modern multi-GPU nodes such as F  =  Expdata[θlogpθ(x)θlogpθ(x)],F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],7MI300X, the method can sustain batches of F  =  Expdata[θlogpθ(x)θlogpθ(x)],F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],8 with no extra memory beyond standard KFAC (Lu et al., 19 Aug 2025).

These algorithmic descriptions show that FOP is not a monolithic implementation. It is a projection principle that can be instantiated with diagonal Fisher approximations, Kronecker-factored preconditioners, layer-wise subspace estimates, or offline Gram-matrix decompositions, depending on the regime.

5. Reported empirical behavior

The continual-learning and large-batch papers report distinct evaluation protocols, but both present FOP as competitive precisely where unprojected updates lose either memory or curvature information (Garg et al., 19 Jan 2026, Lu et al., 19 Aug 2025).

Setting Benchmark Reported outcome
Continual learning Split-MNIST & Rotated-MNIST FOPNG achieved F  =  Expdata[θlogpθ(x)θlogpθ(x)],F \;=\; \mathbb E_{x\sim p_\text{data}}\Big[ \nabla_\theta\log p_\theta(x)\,\nabla_\theta\log p_\theta(x)^\top\Big],9 vs OGD u,vF  =  uFv.\langle u,v\rangle_F \;=\; u^\top F\,v.0 and EWC u,vF  =  uFv.\langle u,v\rangle_F \;=\; u^\top F\,v.1
Continual learning Split-CIFAR10 FOPNG u,vF  =  uFv.\langle u,v\rangle_F \;=\; u^\top F\,v.2 vs OGD u,vF  =  uFv.\langle u,v\rangle_F \;=\; u^\top F\,v.3, EWC u,vF  =  uFv.\langle u,v\rangle_F \;=\; u^\top F\,v.4
Continual learning Split-CIFAR100 FOPNG u,vF  =  uFv.\langle u,v\rangle_F \;=\; u^\top F\,v.5 vs OGD u,vF  =  uFv.\langle u,v\rangle_F \;=\; u^\top F\,v.6, EWC u,vF  =  uFv.\langle u,v\rangle_F \;=\; u^\top F\,v.7
Continual learning Permuted-MNIST FOPNG performed slightly worse than EWC
Large-batch training CIFAR-10, BS=u,vF  =  uFv.\langle u,v\rangle_F \;=\; u^\top F\,v.8 SGD needed u,vF  =  uFv.\langle u,v\rangle_F \;=\; u^\top F\,v.9 s, KFAC gavgg_{\rm avg}00 s, FOP gavgg_{\rm avg}01 s
Large-batch training CIFAR-10, BS=gavgg_{\rm avg}02 only FOP reached gavgg_{\rm avg}03 in gavgg_{\rm avg}04 s
Large-batch training CIFAR-10, BS=gavgg_{\rm avg}05 FOP in gavgg_{\rm avg}06 s
Large-batch training ImageNet-100, BS=gavgg_{\rm avg}07 KFAC gavgg_{\rm avg}08 s, FOP gavgg_{\rm avg}09 s
Large-batch training ImageNet-1K, BS=gavgg_{\rm avg}10 only FOP hit gavgg_{\rm avg}11 in gavgg_{\rm avg}12 min

For continual learning, the paper also states that performance is robust to the extra hyperparameters gavgg_{\rm avg}13 and gavgg_{\rm avg}14, with a good default of gavgg_{\rm avg}15 and gavgg_{\rm avg}16, while wall-clock overhead is gavgg_{\rm avg}17–gavgg_{\rm avg}18 above EWC/OGD and is dominated by extra Fisher-vector multiplies (Garg et al., 19 Jan 2026). The same section notes an important caveat: on Permuted-MNIST, the method is slightly worse than EWC, likely because the random permutations create highly OOD tasks; on more realistic tasks with gradual shifts it consistently outperforms.

For large-batch training, the evaluation spans CIFAR-10 with ResNet-18, ImageNet-100 with T2T-ViT, ImageNet-1K with ResNet-50, and long-tailed CIFAR with ResNet-32. The paper further reports that on ImageNet-100 at BS=gavgg_{\rm avg}19, AdamW required gavgg_{\rm avg}20 s, KFAC gavgg_{\rm avg}21 s, and FOP gavgg_{\rm avg}22 s, whereas at BS=gavgg_{\rm avg}23 KFAC required gavgg_{\rm avg}24 s and FOP gavgg_{\rm avg}25 s. On ImageNet-1K at BS=gavgg_{\rm avg}26, SGD required gavgg_{\rm avg}27 min, KFAC gavgg_{\rm avg}28 min, and FOP gavgg_{\rm avg}29 min. On long-tailed CIFAR, FOP reduces Top-1 error by gavgg_{\rm avg}30–gavgg_{\rm avg}31 relative to strong baselines and by gavgg_{\rm avg}32 vs KFAC (Lu et al., 19 Aug 2025).

6. Extensions and broader significance

PhysGuard extends the projection idea beyond optimization stability and continual learning into sim-to-real adaptation for neural operators. Starting from pretrained parameters gavgg_{\rm avg}33 and per-sample simulation gradients

gavgg_{\rm avg}34

it forms the empirical Fisher

gavgg_{\rm avg}35

where gavgg_{\rm avg}36 stacks the sample gradients row-wise. Because direct eigendecomposition of the gavgg_{\rm avg}37 Fisher is infeasible when gavgg_{\rm avg}38 is large, the paper uses the Gram matrix

gavgg_{\rm avg}39

whose nonzero eigenvalues coincide with those of gavgg_{\rm avg}40. In practice this is done per layer: build gavgg_{\rm avg}41, form gavgg_{\rm avg}42, eigendecompose it, and recover Fisher eigenvectors as

gavgg_{\rm avg}43

Only a compact leading subspace is protected: choose the minimal gavgg_{\rm avg}44 such that

gavgg_{\rm avg}45

with gavgg_{\rm avg}46 in all experiments, then define gavgg_{\rm avg}47 and project each fine-tuning gradient as

gavgg_{\rm avg}48

The reported complexity is gavgg_{\rm avg}49 for gradient collection, gavgg_{\rm avg}50 for Gram construction, gavgg_{\rm avg}51 for the SVD of gavgg_{\rm avg}52, gavgg_{\rm avg}53 for recovering gavgg_{\rm avg}54, and gavgg_{\rm avg}55 per layer for each online projection. The paper states that all steps fit on a single GPU and take minutes offline for gavgg_{\rm avg}56, with only a few milliseconds of per-step overhead during fine-tuning (Zhou et al., 15 Jun 2026).

PhysGuard also gives a domain-specific interpretation of Fisher-sensitive directions. Its spectral probe experiment states that the dominant Fisher directions are strongly associated with low-frequency output structures, and that projecting away from them preserves large-scale physics while still permitting adaptation in the remaining nullspace. Under severe domain shift, the paper reports that it reduces low-frequency error by up to gavgg_{\rm avg}57 compared to standard fine-tuning while maintaining adaptability (Zhou et al., 15 Jun 2026).

Taken together, these works suggest that FOP is best understood not as one optimizer but as a reusable information-geometric design pattern. In one regime it revives natural-gradient curvature at extreme batch size; in another it preserves prior task outputs during sequential learning; in another it protects physics-critical subspaces during sim-to-real transfer. A plausible implication is that the central research question is not whether projection should be used, but which Fisher-defined subspace should be protected or amplified for a given learning regime.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fisher-Orthogonal Projection (FOP).