Fisher–Taylor Sensitivity in Neural Networks

Updated 20 May 2026

Fisher–Taylor Sensitivity (FTS) is a metric that quantifies the influence of neural network parameters through Taylor series and Fisher information, bridging pruning and quantization.
FTS provides a unifying framework that connects classical pruning methods like Optimal Brain Damage with modern activation saliency and second-order quantization strategies.
FTS has demonstrated state-of-the-art performance in high-sparsity regimes, offering practical guidelines for effective model pruning at initialization.

Fisher–Taylor Sensitivity (FTS) is a quantitative criterion for measuring the structural importance of neural network parameters. It arises from the first- and second-order Taylor expansions of the loss function and is mathematically equivalent to the diagonal of the Fisher Information Matrix under specified conditions. FTS has been developed as a unifying theoretical foundation for several methodologies in model pruning and post-training quantization, including activation-aware criteria and Hessian- or Fisher-based channel importance metrics. By precisely characterizing the expected impact of parameter or channel perturbations on model loss, FTS bridges classical pruning approaches (e.g., Optimal Brain Damage), contemporary activation saliency measures, and second-order quantization assignment strategies (Xu, 15 Jan 2026, Navarrete et al., 17 Feb 2025).

1. Mathematical Definition and Derivation

Fisher–Taylor Sensitivity for a parameter or weight channel is derived from the Taylor series expansion of the model loss. For a model with weights $W$ and loss

$\mathcal{L}(W) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(f_W(x), y)],$

a small perturbation $\Delta W$ leads to a first-order approximation: $\mathcal{L}(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W), \Delta W \rangle.$ Specializing to a single linear layer with $y = W\,a(x)$ and activations $a(x) \in \mathbb{R}^{d_{\rm in}}$ , the loss's gradient with respect to $W$ is

$\nabla_W \mathcal{L} = \mathbb{E}_{x}[G(x) a(x)^\top]$

where $G(x) = \frac{\partial \ell}{\partial y} \in \mathbb{R}^{d_{\rm out}}$ is the downstream gradient. Perturbing the $j$ -th column of $\mathcal{L}(W) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(f_W(x), y)],$ 0 gives a first-order loss change governed by

$\mathcal{L}(W) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(f_W(x), y)],$ 1

The channel-wise sensitivity is then defined as

$\mathcal{L}(W) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(f_W(x), y)],$ 2

where $\mathcal{L}(W) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(f_W(x), y)],$ 3 denotes element-wise multiplication. This serves as a first-order metric for the expected squared effect of unit perturbations to a given channel (Xu, 15 Jan 2026).

2. Relationship to the Fisher Information Matrix

For probabilistic models, the Fisher Information Matrix (FIM) is given by

$\mathcal{L}(W) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(f_W(x), y)],$ 4

With loss $\mathcal{L}(W) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(f_W(x), y)],$ 5, the FIM reduces to

$\mathcal{L}(W) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(f_W(x), y)],$ 6

and the diagonal entry for parameter $\mathcal{L}(W) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(f_W(x), y)],$ 7 is

$\mathcal{L}(W) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(f_W(x), y)],$ 8

Given $\mathcal{L}(W) = \mathbb{E}_{x \sim \mathcal{D}}[\ell(f_W(x), y)],$ 9, the channel-wise FTS $\Delta W$ 0 equals $\Delta W$ 1. Thus, FTS is the diagonal Fisher information for that channel, connecting first-order Taylor sensitivity, the Fisher Information, and data-dependent curvature (Xu, 15 Jan 2026, Navarrete et al., 17 Feb 2025).

3. FTS in Pruning and Post-Training Quantization Methodologies

The FTS metric underpins a variety of pruning and quantization methods via approximations to $\Delta W$ 2:

AWQ (Activation-Aware Quantization) assumes isotropic downstream gradients, leading to an activation-magnitude proxy:

$\Delta W$ 3

This recapitulates the rationale of AWQ-type heuristics for channel importance.

GPTQ (Second-Order Quantization) retains input covariance but assumes uniform weighting, so

$\Delta W$ 4

with $\Delta W$ 5 as the activation matrix over calibration data, matching GPTQ’s covariance-based quantization assignments.

Both paradigms are revealed as special cases or approximations of the full FTS metric under specific assumptions. The true Fisher–Taylor sensitivity is achieved only when these assumptions do not eliminate the full activation–gradient covariance structure (Xu, 15 Jan 2026).

4. FTS as a Unifying Framework for Structural Importance Measures

FTS provides conceptual unity across several approaches:

Gradient-Norm (Saliency) Pruning: The classical approach considers the squared gradient norm, corresponding to the first-order Taylor term alone.
Optimal Brain Damage/Surgeon (OBD/OBS): These Hessian-based methods use a second-order loss expansion, sometimes replacing the Hessian with diagonal or low-rank approximations.
Fisher-Based Criteria: These substitute the empirical FIM (often diagonalized) for the Hessian, yielding computationally tractable structural metrics.

All are approximations to the full expansion for expected loss under parameter changes: $\Delta W$ 6 In effect, FTS identifies and formalizes the first-order term, which in many practical contexts aligns precisely with the channel-wise diagonal Fisher information (Xu, 15 Jan 2026).

5. FTS-Based Pruning at Initialization

In one-shot pruning at initialization, FTS quantifies a parameter's importance as

$\Delta W$ 7

where $\Delta W$ 8 and $\Delta W$ 9 is the empirical Fisher diagonal: $\mathcal{L}(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W), \Delta W \rangle.$ 0 This expression combines first-order gradient and second-order Fisher curvature effects, extending pure magnitude or gradient-based pruning with additional curvature awareness (Navarrete et al., 17 Feb 2025).

A typical algorithm:

Compute $\mathcal{L}(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W), \Delta W \rangle.$ 1 and $\mathcal{L}(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W), \Delta W \rangle.$ 2 over $\mathcal{L}(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W), \Delta W \rangle.$ 3 mini-batches at initialization.
Calculate $\mathcal{L}(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W), \Delta W \rangle.$ 4 for every parameter.
Apply a global sparsity threshold: retain parameters with top $\mathcal{L}(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W), \Delta W \rangle.$ 5 values.
Prune remaining weights and train the resulting sparse network.

This procedure requires only $\mathcal{L}(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W), \Delta W \rangle.$ 6 forward–backward passes. Empirically, $\mathcal{L}(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W), \Delta W \rangle.$ 7– $\mathcal{L}(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W), \Delta W \rangle.$ 8 is sufficient. In very deep networks, combining FTS with a brief “warm-up” epoch of dense training robustly prevents layer collapse at extreme sparsity (Navarrete et al., 17 Feb 2025).

6. Experimental Significance and Practical Recommendations

Empirical results for FTS in pruning settings indicate state-of-the-art performance, particularly under high or extreme sparsity. On ResNet-18 with CIFAR-10, FTS matches or exceeds SNIP, GraSP, FD, FP, and magnitude-based approaches at sparsity rates up to 99%. For VGG-19, FTS achieves superior accuracy at equivalent sparsities following a single warm-up epoch.

Example results for ResNet-18 (CIFAR-10):

Sparsity	SNIP	GraSP	Mag	Random	FD	FP	FTS
80%	90.74±0.10	87.18±0.51	91.10±0.12	90.78±0.08	90.95±0.11	91.08±0.06	90.94±0.22
90%	90.36±0.34	86.60±0.51	89.88±0.28	89.35±0.13	90.04±0.21	90.20±0.08	90.55±0.23
95%	89.31±0.17	86.50±0.05	89.23±0.19	87.59±0.11	88.61±0.28	89.50±0.18	89.47±0.32
99%	84.54±0.04	84.56±0.46	71.99±0.28	78.28±0.45	82.13±0.28	83.74±0.48	84.85±0.18

Recommendations for effective FTS application include:

Batch sizes $\mathcal{L}(W + \Delta W) \approx \mathcal{L}(W) + \langle \nabla_W \mathcal{L}(W), \Delta W \rangle.$ 9 of 16–64 balance quality and runtime.
A single epoch of dense SGD suffices to prevent layer-collapse.
No hyperparameter tuning beyond dense model defaults is needed.
Computational cost is comparable to SNIP and significantly less than full Hessian-based approaches.

7. Broader Impact, Extensions, and Theoretical Implications

FTS unifies multiple strands of structural importance estimation under a single metric. It is extensible to:

Structured pruning (block, group) via aggregation of $y = W\,a(x)$ 0 metrics.
Post-training quantization, recovering AWQ or GPTQ weighting in the limiting cases of their respective gradient and covariance assumptions.
Hybrid sparsification/quantization schedules.

Mean-field theory and empirical findings suggest that the empirical Fisher retains sufficient geometric information even at random initialization, supporting the effectiveness of FTS outside maximum-likelihood settings. A plausible implication is that curvature-aware criteria can be integrated efficiently into automated compression pipelines for deep neural networks and LLMs, bypassing the scalability barriers of full Hessian computations (Xu, 15 Jan 2026, Navarrete et al., 17 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Activation Sensitivity as a Unifying Principle for Post-Training Quantization (2026)

Fishing For Cheap And Efficient Pruners At Initialization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fisher–Taylor Sensitivity (FTS).