Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fisher-Guided Gradient Masking (FGGM)

Updated 2 February 2026
  • FGGM is a training strategy that leverages the diagonal Fisher Information Matrix to assess parameter sensitivity and selectively mask gradient updates.
  • It addresses stability-plasticity trade-offs by enabling continual learning, parameter-efficient adaptation, sharpness-aware regularization, and machine unlearning.
  • Empirical studies show FGGM improves performance over baselines by reducing catastrophic forgetting and enhancing efficiency across diverse neural network architectures.

Fisher-Guided Gradient Masking (FGGM) refers to a class of training and adaptation strategies for neural networks that leverage the diagonal Fisher Information Matrix (FIM) to guide sparsification or masking of parameter updates. The core idea is to measure the importance of individual parameters by their Fisher information and use this importance to dynamically freeze, mask, or route gradient flow during fine-tuning or adaptation. FGGM has been developed and applied across continual learning, unlearning, parameter-efficient adaptation, and regularization of LLMs and vision architectures. The approach is motivated by the need to address stability–plasticity trade-offs, mitigate catastrophic forgetting, enable efficient adaptation, and facilitate data deletion with minimal loss of retained performance.

1. Theoretical Foundations and Motivation

FGGM exploits the fact that the Fisher Information quantifies the sensitivity of a model’s likelihood to each parameter, measuring how much a small change in the parameter will affect the model’s predictions. The diagonal Fisher entry for parameter θi\theta_i is given by:

Fi,i=E(x,y)[(θilogp(yx;θ))2]F_{i,i} = \mathbb{E}_{(x,y)}\Bigl[\bigl(\partial_{\theta_i}\log p(y \mid x; \theta)\bigr)^2\Bigr]

In practice, this expectation is estimated as a sample average over the data. Large Fi,iF_{i,i} values indicate that θi\theta_i is crucial for preserving the model’s performance on a given task or dataset; small values mark parameters whose change is likely less consequential. FGGM operates by constructing binary, and sometimes soft, masks indicating which parameters should be protected (frozen) or allowed to update, based on this per-parameter importance estimate. This principled approach stands in contrast to heuristic criteria such as magnitude-based masking, providing a theoretically justified mechanism for managing the stability–plasticity dilemma (Tan et al., 26 Jan 2026, Cao et al., 25 Nov 2025, Zhong et al., 2022, Liu et al., 2023).

2. Methodologies and Core Algorithms

Across applications, FGGM entails three key stages:

  1. Fisher Computation: The diagonal Fisher per parameter is computed, typically via a Monte Carlo average of squared gradients of the log-likelihood or loss for each parameter across a dataset relevant to the adaptation, forgetting, or fine-tuning task. For example, for data (xj,yj)(x_j, y_j) and parameter θi\theta_i:

F^i=1Mj=1M(θi(fθ(xj),yj))2\hat{F}_i = \frac{1}{M} \sum_{j=1}^M \left(\frac{\partial}{\partial \theta_i} \ell(f_\theta(x_j), y_j)\right)^2

  1. Thresholding and Mask Construction: The importance vector I=(F^i)i=1d\vec{I} = (\hat{F}_i)_{i=1}^d is thresholded. Two strategies are prevalent:
    • Quantile-based: parameters with importance above the (1α)(1-\alpha) quantile are masked (frozen). Common values are α=0.7\alpha = 0.7.
    • Moment-based: parameters above τ=μI+κσI\tau = \mu_I + \kappa\,\sigma_I are masked, with tunable κ\kappa. The binary mask mim_i is then

mi={0Ii>τ 1otherwisem_i = \begin{cases} 0 & I_i > \tau \ 1 & \text{otherwise} \end{cases}

  1. Masked Gradient Step: During training, updates are applied only to parameters with mi=1m_i=1. For gradients gig_i, the masked update is

Δθi=mi(ηgi)\Delta\theta_i = m_i \left(-\eta\, g_i\right)

This workflow is adapted in specialized contexts: for parameter-efficient tuning with only a few modules trainable (Cao et al., 25 Nov 2025), for generating sparse perturbations in optimization (e.g., Fisher-SAM) (Zhong et al., 2022), or to gate selective unlearning (Liu et al., 2023). FGGM pseudocode in these applications reflects module-level or per-parameter granularity, and mask update intervals (e.g., every 50–100 steps) are commonly introduced for computational efficiency.

3. Applications: Continual Learning, Adaptation, and Unlearning

FGGM has seen successful instantiations across several domains:

Continual Learning

FGGM addresses catastrophic forgetting in lifelong learning by protecting parameters critical to previous tasks while permitting plastic adaptation for new tasks. On benchmarks such as TRACE (comprising MMLU, BBH, TyDiQA, BoolQ, PIQA, GSM8K), FGGM demonstrably preserves past-task performance better than supervised fine-tuning (SFT) and magnitude-based methods (e.g., MIGU), yielding a 9.6%9.6\% improvement in general (retained) capability over SFT and 4.4%4.4\% over MIGU (Tan et al., 26 Jan 2026).

Parameter-efficient Adaptation

Within CrossEarth-Gate for remote sensing, FGGM guides adaptation by identifying and gating RS modules (spatial, semantic, frequency) with the highest Fisher importance, achieving parameter-efficiency and layer-specific specialization for foundation models. Top-KK modules by Fisher score are dynamically activated for gradient flow, outperforming static PEFT baselines on cross-domain semantic segmentation benchmarks (Cao et al., 25 Nov 2025).

Sharpness-aware Regularization

FGGM (as “FSAM”) improves sharpness-aware minimization via Fisher-guided adversarial perturbation, focusing regularization on parameters with the greatest impact on generalization. This approach consistently improves results across GLUE, SuperGLUE, and generation tasks over vanilla SAM, especially under data-scarce regimes (Zhong et al., 2022).

Machine Unlearning

FGGM, as Fisher masking, is used for principled machine unlearning by constructing masks to freeze parameters identified as important for forgotten data (forget-set), minimizing information retained post-forgetting and reducing performance loss on data to be retained. This achieves nearly complete unlearning—driving forget-set accuracy to $0$—while preserving remain-set accuracy and exhibiting high stability (Liu et al., 2023).

4. Quantitative Results and Empirical Insights

Experiments conducted on FGGM implementations reveal several robust findings:

Domain/Task FGGM Performance Baselines Relative Gain
LLM Continual 55.75% general (Qwen2-1.5B); SFT 50.89%, MIGU 55.21% +9.6% (SFT), +1.0%
RS Domain Gen +1.6 to +3.2 mIoU (CASID DG) LoRA, AdaptFormer +0.5–2.1 mIoU
NLU Benchmarks 81.19 (BERT-L dev) Adam 79.35, SAM 79.85 +1.34 (SAM)
Unlearning (CIFAR) unlearn score ≈76.3 Finetune ≈50 +26.3

FGGM maintains a consistent trade-off: more aggressive masking (higher α\alpha or sparsity) enhances stability but impairs plasticity/flexibility. Model scaling further enhances FGGM's benefits, as demonstrated with Qwen2-7B, where FGGM outpaces MIGU by 1.91% in general and 2.11% in TRACE-OP (Tan et al., 26 Jan 2026). In unlearning, FGGM not only drives accuracy on the forget-set to zero but also minimizes performance degradation on the remain-set and requires similar epochs to relearn as full retraining, indicating completeness of unlearning (Liu et al., 2023).

Ablations indicate the importance of architectural aggregation (input-dimension aware mask construction) and of dynamic, data-driven mask updates. Removing dynamic selection or individual module types in CrossEarth-Gate degrades performance by up to 2.8 mIoU (Cao et al., 25 Nov 2025).

5. Theoretical Properties and Best Practices

The use of Fisher information as the importance metric is theoretically supported: it is the unique (unbiased, second-order) local metric of parameter sensitivity under mild regularity conditions, linked to the expected change in loss under infinitesimal parameter perturbations. This provides mathematical justification over magnitude-based or activation-based approaches (Tan et al., 26 Jan 2026, Cao et al., 25 Nov 2025).

FGGM preserves the convergence guarantees of the underlying optimizer (e.g., Adam, SAM), as the masking does not alter the overall convergence rate under bounded gradient assumptions (Zhong et al., 2022).

Best practices across domains include:

  • Computing the diagonal Fisher matrix per task (offline) or updating it online for efficiency.
  • Choosing masking hyperparameters (α\alpha for quantile, κ\kappa for moment-based) to mediate the stability–plasticity trade-off; defaults of α=0.7\alpha=0.7 and κ[0.5,1.5]\kappa \in [0.5,1.5] are robust.
  • Aggregating importance over functional units (input dimension for LLMs, module for PEFT) to ensure architectural relevance.
  • Employing hard thresholded masks for strict preservation needs and exploring soft masking for more gradual adaptation.
  • Selecting module-level masking intervals and batch sizes to balance computational cost and adaptation speed (Tan et al., 26 Jan 2026, Cao et al., 25 Nov 2025).

6. Extensions, Implementation Considerations, and Limitations

FGGM can be instantiated with both binary and soft masks and extended to groupwise, neuronwise, or continuous forms. In unlearning, zeroing masked weights may improve escape from pre-existing local minima. For domain adaptation, modular scoring enables specialization and interpretable adaptation paths, such as early semantic, mid spatial, and late frequency specialization in transformer architectures (Cao et al., 25 Nov 2025).

Limitations include the computational cost of Fisher estimation (requiring extra backward passes), potential access requirements to gradients on domains of interest (forget-set, remain-set), and the trade-off between preservation and adaptability—the more parameters masked, the greater the protection (or forgetting), at the cost of reduced adaptation capacity. Approximate low-rank or Hessian-vector methods, or differentially private noise, are plausible extensions (Liu et al., 2023).

A plausible implication is that as models increase in scale and continual/dynamic tasks become more prevalent, FGGM's combination of principled sparsity and adaptive protection will provide a preferred recipe for both continual learning and compliant unlearning.

FGGM is distinct in its explicit use of the Fisher Information for parameter selection:

  • Magnitude-based masking (e.g., MIGU): selects on activation magnitudes without theoretical sensitivity guarantees (Tan et al., 26 Jan 2026).
  • Optimization-based sparsification (e.g., vanilla SAM): applies uniform regularization, missing importance-based selectivity (Zhong et al., 2022).
  • EWC (Elastic Weight Consolidation): employs Fisher but typically assumes access to prior-task data; FGGM avoids this requirement for continual learning (Tan et al., 26 Jan 2026).
  • Parameter-efficient adaptation (PEFT) baselines: utilize static module selection; FGGM dynamically routes adaptation via Fisher scoring, yielding interpretability and efficiency (Cao et al., 25 Nov 2025).

Empirical comparisons indicate FGGM achieves state-of-the-art stability–plasticity and adaptation-forgetting-retention trade-offs across diverse benchmarks and architectures.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fisher-Guided Gradient Masking (FGGM).