Gradient-based Explanations

Updated 16 March 2026

Gradient-based explanations are post hoc interpretability techniques that compute gradients to quantify input influence, aiding model transparency.
They include methods like Vanilla, Integrated, and SmoothGrad that balance trade-offs between faithfulness, noise reduction, and computational efficiency.
These techniques are applied across vision, language, survival analysis, and more, offering both deterministic and stochastic insights for robust model auditing.

Gradient-based explanations are a class of post hoc model interpretability and feature attribution techniques that use the gradients of a function—typically a neural network's output—with respect to its inputs or internal representations to identify which features are most influential for a specific prediction. This paradigm encompasses a wide spectrum of technically rigorous methods, each making specific mathematical, algorithmic, and epistemic commitments about how to connect gradients to model logic. The field has evolved to include deterministic, stochastic, and manifold-aware approaches, and now underpins transparency efforts in domains ranging from vision, language, and survival analysis, to model auditing and adversarial robustness.

1. Taxonomy, Core Principles, and Methodological Frameworks

Gradient-based explanations comprise several well-defined methodological families, each characterized by its approach to attributing importance scores to input features or internal representations:

Vanilla Gradient Methods: These compute the input gradient $\nabla_x f(x)$ or its absolute value, sometimes composed with the input itself ("gradient × input"). This lineage includes Saliency Maps, Guided Backpropagation, DeconvNet, and RectGrad (Wang et al., 2024).
Integrated Gradient Methods: These address gradient saturation and noise by integrating gradients along paths between a baseline $x'$ and the input $x$ , as in Integrated Gradients (IG) and its many variants. For differentiable $f$ , IG is given by:

$\mathrm{IG}_i(x) = (x_i - x'_i)\int_{\alpha=0}^1 \frac{\partial f(x'+\alpha(x-x'))}{\partial x_i} \, d\alpha$

(Wang et al., 2024, Seitz, 2022).

Bias-Gradient Methods: These separate output contributions into input-gradient and bias terms before mapping back to the input; FullGrad is a canonical example (Wang et al., 2024).
Post-processing/Denoising Methods: These methods, including SmoothGrad and VarGrad, apply noise perturbations and compute statistical aggregates (mean, variance) of gradient-based saliency to improve visual and statistical stability (Agarwal et al., 2021, Mehrpanah et al., 14 Aug 2025).
Class Activation Mapping (CAM) Variants: Methods such as Grad-CAM operate on feature maps within convolutional architectures, combining channelwise gradient information and activations to yield spatial heatmaps, with further advances integrating region-based and pixelwise fusion (Fusion-CAM (Dekdegue et al., 5 Mar 2026), ODAM (Zhao et al., 2023)).
Spectral/Probabilistic Methods: Recent analyses characterize gradient explanations as band-pass filters on the model's spectral density, formalizing how perturbations and smoothing impact explanation frequency content and faithfulness (Mehrpanah et al., 14 Aug 2025, Mehrpanah et al., 14 Aug 2025).
Taylor/Local Linear Expansion: T-Explainer frames attributions as finite-difference Taylor gradients, yielding deterministic, stable, additive explanations that approximately satisfy desiderata such as local accuracy and consistency (Ortigossa et al., 2024).

These methodologies are grounded in rigorous axioms—completeness, sensitivity, implementation invariance (Seitz, 2022), and consistency (Ortigossa et al., 2024)—and are evaluated along two principal axes: faithfulness (does the explanation reflect the true causal pathways in the model?) and interpretability (does it align with human intuition or downstream utility?) (Wang et al., 2024).

2. Mathematical Formulation, Algorithmic Realizations, and Spectral Analysis

The technical form of gradient-based explanations is contingent on the domain and specific method:

Vanilla Saliency (Vision): $S(x) = |\nabla_x f(x)|$
Gradient × Input: $S(x) = x \odot \nabla_x f(x)$
Integrated Gradients: (see above)
SmoothGrad: $S_{SG}(x) = \frac{1}{N} \sum_{k=1}^N \nabla_x f(x + \delta^{(k)}),~\delta^{(k)} \sim \mathcal{N}(0, \sigma^2 I)$ (Agarwal et al., 2021).
Grad-CAM: for feature map $A_k^l$ ,

$\alpha^c_k = \frac{1}{Z} \sum_{i=1}^H \sum_{j=1}^W \frac{\partial y^c}{\partial A^k_{i,j}}$

$L_{\text{Grad-CAM}}^c = \text{ReLU}\left(\sum_k \alpha^c_k A^k\right)$

(Wang et al., 2024, Dekdegue et al., 5 Mar 2026).

Probabilistic & Spectral View: Many explanations can be written as $E_{\epsilon \sim p(\cdot; x, \sigma)} [g(x + \epsilon)]$ , with $g$ an explainer (gradient, squared gradient, etc.), and $p$ a noise or perturbation kernel. The spectral representation reveals that gradient-based maps act as band-pass filters, with smoothing (e.g., SmoothGrad) modulating the frequency content probed in the model's PSD. Squared-gradient quantities focus on power spectrum, with the explicit formula:

$E_{\epsilon}[(\nabla f(x+\epsilon))^2] = 4\pi^2 \int \|\omega\|^2 S_f(\omega) S_{\sqrt{p}}(\omega) d\omega$

(Mehrpanah et al., 14 Aug 2025).

Stability/Equivalence of SmoothGrad and LIME: In the limit, SmoothGrad and continuous-LIME yield closed-form linear attributions proportional to the covariance between perturbed inputs and outputs, and are robust in a Lipschitz sense (Agarwal et al., 2021).
Taylor-based Additive Methods: T-Explainer computes per-feature attributions as finite-difference gradients,

$\phi_i \approx \frac{f(x + h e_i) - f(x - h e_i)}{2h}$

and builds an additive surrogate $g_x(z') = \phi_0 + \sum_{i=1}^n \phi_i z'_i$ , up to second-order remainder (Ortigossa et al., 2024).

Alignment with Data Manifolds: Gradient attributions are projected onto the (VAE-estimated) tangent space $T_x\mathcal{M}$ of the data manifold, with the alignment score

$\text{Align}(A, x) = \frac{\| \text{proj}_{T_x\mathcal{M}} A(x) \|_2}{\|A(x)\|_2}$

(Bordt et al., 2022).

3. Quantitative and Qualitative Evaluation: Faithfulness, Robustness, Smoothness, and Trade-offs

A range of metrics and experimental setups are employed to assess the efficacy of gradient-based explanations:

Faithfulness Metrics: Insertion/deletion AUC (measuring output drops/rises as features are ablated/restored in order of importance) (Dekdegue et al., 5 Mar 2026, Mehrpanah et al., 14 Aug 2025), Average Drop/Increase, and ablation-based retrain (ROAR) (Wang et al., 2024).
Smoothness-Complexity: Expected Frequency (EF) statistic quantifies the mean frequency of power in the explanation spectrum. Large EF signals noisy, high-frequency explanations; post hoc smoothing (e.g., SmoothGrad) reduces EF at the cost of faithfulness, creating an "explanation gap" $\Delta EF$ (Mehrpanah et al., 14 Aug 2025).
Uncertainty Quantification: Explanations are characterized via their empirical distribution under epistemic uncertainty (ensembles, MC-dropout), with mean, standard deviation, and the coefficient of variation (CV) used to flag high-variance or low-confidence regions (Mulye et al., 2024).
Instance-differentiation and Compactness: Methods like ODAM target per-prediction (instance-specific) localization, compactness, and discrimination between overlapping objects—criteria not captured by class-activation approaches (Zhao et al., 2023).

Empirical studies consistently indicate:

Path-integral (IG) and post-processed (SmoothGrad, VarGrad) methods yield less noisy and more human-interpretable attributions than vanilla gradients.
Denoising trades off faithfulness for visual conciseness, as measured by $\Delta EF$ and insertion/deletion AUC (Mehrpanah et al., 14 Aug 2025, Dekdegue et al., 5 Mar 2026).
On high-dimensional tabular and vision tasks, deterministic methods (T-Explainer) achieve high stability and local accuracy, outperforming black-box SHAP and LIME in both mean error and explanation reproducibility (Ortigossa et al., 2024).
Alignment with the manifold tangent space enhances perceptual plausibility of attributions; input-gradient, IG, and SmoothGrad all outperform raw gradients in alignment (Bordt et al., 2022).

4. Application Scope and Domain Extensions

Gradient-based explanations have been effectively extended beyond standard classification and regression to:

Language and Large Model Attribution: Grad-ELLM assigns per-token importance in decoder-only transformers by combining attention layer gradients and value aggregation, outperforming standard input-agnostic methods in faithfulness as measured by $\pi$ -Soft-NC/NS metrics (Huang et al., 6 Jan 2026).
Survival Analysis: Gradient attributions (GradSHAP(t)) provide interpretable, time-resolved explanations for Cox-type and general deep survival models, with efficiency and local completeness guarantees in time-dependent settings (Langbein et al., 7 Feb 2025).
Time-Series/Temporal Explanations: Frame-level saliency extraction in transformer-based video models is achieved by aggregating gradients over input sequences to score important timesteps (Lee, 2024).
3D Geometric Data: Saliency methods are extended from images to 3D point clouds and voxel spaces, identifying geometrically meaningful structures (edges/corners vs planes) as most influential [(Gupta et al., 2020) abstract].
Black-box and Closed-source APIs: Zeroth-order (likelihood-ratio) estimators allow faithful approximation of gradients in settings lacking backward access, enabling saliency map extraction for models like GPT-Vision (zhang et al., 2024).
Model Confidentiality: Access to input-gradients or even processed saliency explanations enables highly query-efficient model extraction or inversion attacks, reducing the security of proprietary systems by orders of magnitude compared to label-only querying (Milli et al., 2018).

5. Enhancements, Limitations, and Future Directions

Active methodological innovation and critical analysis continue to advance the field:

Spectral Regularization: Introducing smooth activations or penalizing high-frequency explanation power shifts the Pareto frontier of smoothness–faithfulness trade-off, allowing construction of inherently interpretable model architectures (Mehrpanah et al., 14 Aug 2025).
Noise and Instability: Choices of perturbation scale (e.g., $\sigma$ in SmoothGrad), number of samples, or integration steps can lead to inconsistent, non-robust explanations—a formal “Rashomon Effect” (Mehrpanah et al., 14 Aug 2025). Strategies include scale selection via PSD cosine similarity or multi-scale aggregation (SpectralLens).
Interpretability Enhancement: Frameworks such as GAD (Gradient Artificial Distancing) use model ensembling over artificially distanced logits to sharpen, denoise, and focus explanations on the truly discriminative support (Rodrigues et al., 2024).
Manifold Alignment: Ensuring that attributions lie in the estimated tangent space of the data manifold is proposed as a necessary criterion for perceptual and semantic alignment; adversarially robust training or manifold-aware explanation can promote this property (Bordt et al., 2022).
Uncertainty and Trust: Empirically, explanations for tabular and image models may show high variance under model or data uncertainty; methods such as Guided Backpropagation exhibit significantly lower uncertainty than, e.g., IG or LIME, and should be preferred in critical applications (Mulye et al., 2024).
Black-box gradient estimation: Likelihood-ratio and blockwise estimators provide unbiased, scalable mechanisms to restore the explanatory power of gradients for closed deep models, with demonstrated effectiveness even for multimodal and hard-label systems (zhang et al., 2024).
Security and Privacy: The power of gradients as a learning primitive presents a fundamental tension between interpretability and confidentiality; selective obfuscation or coarser explanation strategies (class-activation maps, privacy mechanisms) are prominent mitigations (Milli et al., 2018, Wang et al., 2024).

Open lines of inquiry include adaptive, task-aware path integration; group- or interaction-based attributions; privacy-preserving explanations; and algorithmic frameworks for controlling explanation bandwidth or manifold alignment in a theoretically grounded manner.

6. Representative Methods and Quantitative Benchmarks

A distilled comparative table of representative gradient-based explanation methods and salient properties, synthesizing results across the cited literature:

Method	Faithfulness Score	Smoothness	Uncertainty	Determinism	Applicability
Vanilla Gradient	High (faithful)	Low (noisy)	Variable	Yes	White-box only
Integrated Grad.	High	Moderate	Variable	Yes	White-box, some black-box†
SmoothGrad	Moderate	High	Lower	No†	White-box, some black-box†
Grad-CAM	Moderate	Moderate	Lower†	Yes	Vision, class/region-level
T-Explainer	High†	Moderate	Lowest†	Yes	Model-agnostic, deterministic
ODAM/Fusion-CAM	Task-dependent	High	Not reported	Yes	Object detection, vision
Grad-ELLM	Highest (LLMs)	Moderate†	Not eval.	Yes	Transformer LLMs
BLR/LR (black-box)	High (with tuning)	Variable	Controlled†	No	Black-box APIs

† denotes strong performance or property established relative to named state-of-the-art or in specific benchmarking context (Ortigossa et al., 2024, Huang et al., 6 Jan 2026, zhang et al., 2024, Dekdegue et al., 5 Mar 2026, Zhao et al., 2023, Wang et al., 2024).

7. Conclusion

Gradient-based explanations form a mathematically, algorithmically, and empirically rich subfield of explainable AI, linking the analytical tractability of derivatives to diverse application areas in modern deep learning. The field is marked by theoretical advances in spectral and probabilistic characterization, by practical innovations in both white-box and black-box settings, and by ongoing progress in evaluating, stabilizing, and human-aligning feature attributions. The future trajectory will likely center on reconciling smoothness and faithfulness quantitatively, enforcing manifold congruence, managing explanatory uncertainty, and ensuring transparency does not inadvertently breach confidentiality (Wang et al., 2024, Mehrpanah et al., 14 Aug 2025, Bordt et al., 2022, Mulye et al., 2024, Milli et al., 2018).