Fisher-Orthogonal Projection (FOP)
- FOP is a family of projection methods that use the Fisher information metric to isolate meaningful gradient directions from redundant or task-preserving ones.
- It is applied in large-batch training, continual learning, and sim-to-real neural operator tuning to maintain essential model structure while adapting.
- By enforcing Fisher-orthogonality, FOP improves update stability and performance compared to standard Euclidean gradient methods.
Fisher-Orthogonal Projection (FOP) denotes a family of information-geometric projection procedures in which gradients or parameter updates are constrained to be orthogonal, under a Fisher-induced metric, to directions deemed redundant, protected, or task-preserving. In the literature provided here, the term appears in at least three closely related roles: as the core mechanism in the large-batch natural-gradient method "Fisher-Orthogonal Projection Methods for Natural Gradient Descent with Large Batches" (Lu et al., 19 Aug 2025), as the central projection principle in the continual-learning optimizer "Fisher-Orthogonal Projected Natural Gradient Descent for Continual Learning" (Garg et al., 19 Jan 2026), and as the preservation mechanism used by PhysGuard for sim-to-real neural PDE surrogates (Zhou et al., 15 Jun 2026). Across these settings, FOP is used to preserve useful structure while still permitting adaptation or descent.
1. Conceptual scope and unifying idea
The common principle behind FOP is that not all gradient components should be treated equally. In standard first-order updates, directions are separated only by their Euclidean coordinates. FOP instead uses the Fisher information matrix to define which directions are meaningful, redundant, or dangerous to move along. In the large-batch setting, the projected object is a gradient-difference term extracted from two sub-batches; in continual learning, the projected object is the update for the current task relative to gradients associated with previous tasks; in PhysGuard, the protected directions are top Fisher eigendirections estimated from simulation data and removed from fine-tuning updates (Lu et al., 19 Aug 2025, Garg et al., 19 Jan 2026, Zhou et al., 15 Jun 2026).
| Setting | Projected quantity | Reference or protected directions |
|---|---|---|
| Large-batch natural gradient | under the Fisher metric | |
| Continual learning | Update or projected gradient | Previous-task gradients under |
| PhysGuard | Fine-tuning gradient | Top Fisher eigendirections |
This shared geometry is important because each paper frames Euclidean orthogonality as insufficient for the relevant invariances. The continual-learning formulation explicitly contrasts Fisher-orthogonal constraints with methods that operate in Euclidean parameter space. The large-batch formulation uses Fisher-orthogonality to isolate intra-batch variability that is not already represented by the average gradient. PhysGuard uses Fisher-derived sensitive directions to preserve low-frequency physical structure during sim-to-real adaptation.
2. Information-geometric formulation
In the large-batch formulation, model parameters are , the Fisher information matrix is
and the Fisher metric is
A large mini-batch 0 is split into two disjoint sub-batches 1, yielding
2
with
3
The Fisher-orthogonal projection removes from 4 any component already present in 5: 6 so that 7. The corrected gradient is then
8
with the adaptive weight
9
Using damping 0, the final update is
1
This construction is presented as a variance-aware update direction that leverages gradients from two sub-batches and enhances the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher metric (Lu et al., 19 Aug 2025).
In continual learning, the formulation is explicitly task-sequential. Let 2, let 3 denote the loss of task 4, let 5, and let
6
After tasks 7, collect final gradients 8 as columns of 9 and denote by 0 the Fisher matrix estimated on previous-task data. The Fisher-orthogonal complement is the set of all parameter increments 1 satisfying
2
To project an arbitrary vector 3, define
4
compute
5
and obtain a projected vector satisfying 6. The resulting projected natural-gradient update solves a constrained problem that maximizes progress on the new loss, stays within a trust-region in the new-task Fisher metric, and is Fisher-orthogonal to old tasks: 7 If the orthogonality constraint is dropped, 8 and the method recovers ordinary natural gradient; if 9 and 0, it recovers Euclidean orthogonal-gradient descent (Garg et al., 19 Jan 2026).
3. Geometric properties and interpretation
The continual-learning formulation states three geometric properties. First, the update 1 is reparameterization-invariant because all norms and projections are defined via the Fisher information. Second, it guarantees descent in the Fisher metric, with 2. Third, by enforcing 3, the second-order change in KL divergence on each previous task is zero. The exposition further states that this ensures that moving by such a 4 causes, to second order, no change in the model’s output distributions on prior tasks (Garg et al., 19 Jan 2026).
The large-batch formulation emphasizes a different, though related, geometric role. It is motivated by the claim that very large batches suppress gradient noise and that high damping can wash out the curvature information in KFAC. FOP is introduced specifically to restore useful curvature at large batch sizes by injecting a variance-aware correction into the natural-gradient step, while projecting out redundant directions under the Fisher metric so that only genuinely new curvature information is added (Lu et al., 19 Aug 2025).
A recurrent misconception is to equate FOP with ordinary Euclidean orthogonalization. The primary formulations above do not do that: they define orthogonality with respect to the Fisher metric. PhysGuard presents a useful contrast. There, the protected subspace is first identified from the empirical Fisher eigenspectrum, but the online projector is written in Euclidean form,
5
The paper explicitly notes that one may view this under the Fisher-induced inner product similarly, but in practice 6 under 7 is equivalent once the columns of 8 are Fisher eigenvectors scaled to unit 9 norm (Zhou et al., 15 Jun 2026). This indicates that the decisive ingredient is often not the final algebraic appearance of the projector, but the Fisher-derived choice of the protected subspace.
4. Algorithms and computational profile
The continual-learning algorithm is presented with diagonal-Fisher approximations for both 0 and 1. The procedure is: initialize 2; train on task 3 normally; estimate diagonal Fisher 4; store last-task gradients 5; for each subsequent task, compute a diagonal 6 on a small random batch; for each minibatch compute 7, build
8
solve 9 via 0 inversion, project
1
set
2
and update 3. After each task, update
4
and append new gradients to 5 while keeping at most 6 columns. Here 7 is a small ridge for numerical stability, 8 averages the old-task Fisher, and 9 is folded into the learning rate 0. The cost of diagonal Fisher storage and multiplication is 1, storing 2 old gradients costs 3, forming 4 costs 5, inverting 6 costs 7, and the total per-batch extra cost is 8. The exposition states that in practice 9–0 so 1 is negligible, and that a PreFisher variant stores 2 once per task to eliminate 3 in the inner loop (Garg et al., 19 Jan 2026).
The large-batch algorithm is correspondingly direct if a KFAC pipeline already exists. Each full batch is split into two equal halves, two gradients are computed, the Fisher-metric projection is applied, and the combined gradient is passed through the existing KFAC preconditioner instead of the mean gradient. The pseudocode states that the method requires two forward-backward passes per update, so approximately 4 the cost of a single gradient step. KFAC factors cost 5 per layer, inversions cost 6 but are typically done every few steps, memory stores two gradient vectors plus KFAC factors and inverses, and distributed training uses two global gradients in parallel over disjoint GPU groups with dual all-reduces. The paper further states that on modern multi-GPU nodes such as 7MI300X, the method can sustain batches of 8 with no extra memory beyond standard KFAC (Lu et al., 19 Aug 2025).
These algorithmic descriptions show that FOP is not a monolithic implementation. It is a projection principle that can be instantiated with diagonal Fisher approximations, Kronecker-factored preconditioners, layer-wise subspace estimates, or offline Gram-matrix decompositions, depending on the regime.
5. Reported empirical behavior
The continual-learning and large-batch papers report distinct evaluation protocols, but both present FOP as competitive precisely where unprojected updates lose either memory or curvature information (Garg et al., 19 Jan 2026, Lu et al., 19 Aug 2025).
| Setting | Benchmark | Reported outcome |
|---|---|---|
| Continual learning | Split-MNIST & Rotated-MNIST | FOPNG achieved 9 vs OGD 0 and EWC 1 |
| Continual learning | Split-CIFAR10 | FOPNG 2 vs OGD 3, EWC 4 |
| Continual learning | Split-CIFAR100 | FOPNG 5 vs OGD 6, EWC 7 |
| Continual learning | Permuted-MNIST | FOPNG performed slightly worse than EWC |
| Large-batch training | CIFAR-10, BS=8 | SGD needed 9 s, KFAC 00 s, FOP 01 s |
| Large-batch training | CIFAR-10, BS=02 | only FOP reached 03 in 04 s |
| Large-batch training | CIFAR-10, BS=05 | FOP in 06 s |
| Large-batch training | ImageNet-100, BS=07 | KFAC 08 s, FOP 09 s |
| Large-batch training | ImageNet-1K, BS=10 | only FOP hit 11 in 12 min |
For continual learning, the paper also states that performance is robust to the extra hyperparameters 13 and 14, with a good default of 15 and 16, while wall-clock overhead is 17–18 above EWC/OGD and is dominated by extra Fisher-vector multiplies (Garg et al., 19 Jan 2026). The same section notes an important caveat: on Permuted-MNIST, the method is slightly worse than EWC, likely because the random permutations create highly OOD tasks; on more realistic tasks with gradual shifts it consistently outperforms.
For large-batch training, the evaluation spans CIFAR-10 with ResNet-18, ImageNet-100 with T2T-ViT, ImageNet-1K with ResNet-50, and long-tailed CIFAR with ResNet-32. The paper further reports that on ImageNet-100 at BS=19, AdamW required 20 s, KFAC 21 s, and FOP 22 s, whereas at BS=23 KFAC required 24 s and FOP 25 s. On ImageNet-1K at BS=26, SGD required 27 min, KFAC 28 min, and FOP 29 min. On long-tailed CIFAR, FOP reduces Top-1 error by 30–31 relative to strong baselines and by 32 vs KFAC (Lu et al., 19 Aug 2025).
6. Extensions and broader significance
PhysGuard extends the projection idea beyond optimization stability and continual learning into sim-to-real adaptation for neural operators. Starting from pretrained parameters 33 and per-sample simulation gradients
34
it forms the empirical Fisher
35
where 36 stacks the sample gradients row-wise. Because direct eigendecomposition of the 37 Fisher is infeasible when 38 is large, the paper uses the Gram matrix
39
whose nonzero eigenvalues coincide with those of 40. In practice this is done per layer: build 41, form 42, eigendecompose it, and recover Fisher eigenvectors as
43
Only a compact leading subspace is protected: choose the minimal 44 such that
45
with 46 in all experiments, then define 47 and project each fine-tuning gradient as
48
The reported complexity is 49 for gradient collection, 50 for Gram construction, 51 for the SVD of 52, 53 for recovering 54, and 55 per layer for each online projection. The paper states that all steps fit on a single GPU and take minutes offline for 56, with only a few milliseconds of per-step overhead during fine-tuning (Zhou et al., 15 Jun 2026).
PhysGuard also gives a domain-specific interpretation of Fisher-sensitive directions. Its spectral probe experiment states that the dominant Fisher directions are strongly associated with low-frequency output structures, and that projecting away from them preserves large-scale physics while still permitting adaptation in the remaining nullspace. Under severe domain shift, the paper reports that it reduces low-frequency error by up to 57 compared to standard fine-tuning while maintaining adaptability (Zhou et al., 15 Jun 2026).
Taken together, these works suggest that FOP is best understood not as one optimizer but as a reusable information-geometric design pattern. In one regime it revives natural-gradient curvature at extreme batch size; in another it preserves prior task outputs during sequential learning; in another it protects physics-critical subspaces during sim-to-real transfer. A plausible implication is that the central research question is not whether projection should be used, but which Fisher-defined subspace should be protected or amplified for a given learning regime.