Feature Attribution

Updated 28 February 2026

Feature Attribution is the formal technique for quantifying each input's contribution to a model's prediction using principles from game theory and calculus.
It encompasses a range of methods—including gradient-based, perturbation-based, and formal approaches—to explain and debug black-box models.
Recent evaluations focus on metrics like deletion/insertion scores while addressing challenges such as computational scalability and baseline sensitivity.

Feature attribution is the formal quantification of how individual input features contribute to the output of a machine learning model. Attribution methods assign a numeric importance score to each input variable, allowing researchers and practitioners to interpret, explain, and potentially debug black-box models. Due to its implications for scientific insight, regulatory compliance, and trust in AI systems, feature attribution has become a central topic across model classes, modalities, and application domains.

1. Theoretical Foundations and Definitions

Feature attribution operates at the intersection of functional analysis, cooperative game theory, and statistical inference. Methods can be framed as producing an importance vector $a \in \mathbb{R}^d$ for an input $x$ , where component $a_j$ reflects the unique contribution, marginal effect, or interaction value of feature $x_j$ towards the output $f(x)$ .

Axiomatic Characterization

Classic approaches, such as the Shapley value, derive attributions from axioms—local accuracy (completeness), consistency, missingness, additivity, and symmetry are the predominant desiderata (Lundberg et al., 2017). However, results show that imposing completeness, sensitivity, and linearity simultaneously forces methods to degenerate to Gradient×Input, motivating frameworks that relax or replace these axioms (Taimeskhanov et al., 30 May 2025). The Weighted Möbius Score further unifies a broad range of attributions by representing all set functions on the feature power-set as basis expansions over Harsanyi dividends, connecting cooperative game values (Shapley, Banzhaf, Owen, etc.) to attribution solutions (Jiang et al., 2023).

Formal feature attribution (FFA) sidesteps axioms entirely, instead defining importance via the fraction of minimal formal explanations (subsets of features sufficing for the prediction) in which each feature occurs, relying on logical sufficiency rather than smooth or linear approximations (Yu et al., 2023, Yu et al., 2023).

2. Methodological Approaches and Algorithms

A diverse taxonomy of feature attribution methods encompasses gradient-based, perturbation-based, game-theoretic, surrogate-model, and distributional strategies.

Gradient-Based and Path Methods

Vanilla gradient/saliency: Compute $s_j = \partial f(x)/\partial x_j$ .
Integrated Gradients (IG): Path-integrate gradients from a reference input to the target input [Sundararajan et al., 2017]:

$\mathrm{IG}_j(x) = (x_j - x_j')\int_{0}^{1}\frac{\partial f(x'+\alpha(x-x'))}{\partial x_j}\,d\alpha$

Manifold Integrated Gradients (MIG): Replace straight-line paths with Riemannian geodesics on a learned data manifold to reduce noise and adversarial susceptibility (Zaher et al., 2024).

Perturbation and Surrogate-Model Methods

Feature Ablation/Occlusion: Measure output differences when subsets of feature values are replaced by baseline values.
LIME: Fit a local linear model based on synthetic perturbations to approximate the model's behavior near a chosen instance [Ribeiro et al.].
Fourier Feature Attribution: Exploit DFT to attribute importance in the frequency domain, with rigorous game-theoretic deletion/insertion metrics (Liu et al., 2 Apr 2025).

Formal and Distributional Methods

Formal Feature Attribution (FFA): For a prediction $c$ on input $v$ , enumerate all minimal feature sets (abductive explanations) $X$ such that fixing $x_i = v_i$ for all $i \in X$ guarantees $f(x) = c$ . The attribution for feature $i$ is the fraction of such explanations including $i$ (Yu et al., 2023, Yu et al., 2023).
Distributional Feature Attribution (DFAX): Attribute based on the difference in kernel density estimates of each feature's value between the predicted class and alternative classes, using only the data distribution (Li et al., 12 Nov 2025).

Specialized and Domain-Specific Methods

Prospector Heads: Modular, encoder-compatible heads assigning attributions via concept quantization and context-aware convolutions over token (patch, sentence, or residue) graphs. Designed for efficient, parameter-light, and modality-agnostic localization (Machiraju et al., 2024).
LAFA for NLP: Aggregates gradients over similar sentences (neighbors in embedding space) to robustly attribute tokens in NLP, circumventing difficulties of defining reference "null" text (Zhang et al., 2022).
Multiscale/Inverse Occlusion: In anomaly detection/outlier analysis, invert occlusion by inserting suspect segments into clean baselines to compute attribution as the resultant change in outlier score, combining information over multiple spatial/frequency scales (Shen et al., 2023).

Faithfulness, Soundness, and Completeness

Rigorous evaluation frameworks have been advanced to assess the faithfulness of attributions, quantifying not just the order of feature importance but also whether all truly predictive features are included (completeness) and whether attributions avoid false positives (soundness). Efficient algorithms for these dual metrics enable fine-grained assessment and tradeoff visualization among attribution approaches (Li et al., 2023).

3. Evaluation Metrics, Benchmarks, and Sanity Checks

Methodological rigor in attribution evaluation requires both synthetic ground-truth environments and real-world proxy or task-based metrics.

Perturbation and Faithfulness Metrics

Deletion/Insertion Score: Measure the model's output as highly ranked features are masked (deletion) or gradually restored (insertion); area-under-curve quantifies attribution faithfulness (Gevaert et al., 2022).
Sensitivity-n and Infidelity: Correlate the sum of attributions for randomly selected feature subsets with the actual change in model output under those perturbations.
Inter-Seed Agreement (ISA): For speech, measure the intersection of top-quantile attribution sets across random fine-tuning seeds to assess method reliability (Shen et al., 22 May 2025).

Controlled/Synthetic Lab Environments

AttributionLab: Pair hand-crafted models (with known ground-truth) and synthetic datasets to assess not only spatial alignment but sign and grouping accuracy; demonstrates strong dependence of faithfulness on baseline choice, segmentation, and overfitting to distributional shifts (Zhang et al., 2023).
Formal Selection Frameworks: Generate distributions with oracle-accessible ground-truth support sets, making it feasible to benchmark both instance-wise (minimal) and relaxed, probabilistic attribution rules (Afchar et al., 2021).

Human-in-the-Loop and Downstream Utility

User studies, such as those evaluating GradCAM, Extremal Perturbation, and prototype-based explanations, show only modest or even negative impact of standard heatmap attributions on human-AI team accuracy, especially for fine-grained and adversarially perturbed tasks. Human usefulness often poorly correlates with automatic metrics such as Intersection over Union or Pointing Game (Nguyen et al., 2021).

4. Advances in Model Classes, Modalities, and Task Structures

Feature attribution algorithms must evolve with model architecture, domain, and inference scale:

Tree Ensembles: SHAP values, uniquely determined by three axioms (local accuracy, missingness, consistency), are computed in polynomial time for decision forests via path-wise conditioning and recursive weight allocation (Lundberg et al., 2017).
Ranking Models (Listwise): Extension of Shapley values to ranking/permutation outputs, with listwise coalition masking and evaluation against permutation-sensitive objectives (e.g., NDCG, Kendall's τ, group fairness), as in RankingSHAP (Heuss et al., 2024).
Speech: Reliability collapses at fine temporal scales due to high redundancy and distributed acoustic cues; only word-aligned perturbations in word-centric tasks yield robust attributions. For speaker-identity or gender, no configuration achieves acceptable reliability (Shen et al., 22 May 2025).
Biomedical Graph and Patch Data: Prospector heads generalize across sequence, image, and protein-graph inputs by encoding local and global concept associations, enabling accurate, interpretable, and data-efficient token localization in the presence of noisy or low-prevalence signals (Machiraju et al., 2024).

5. Limitations, Open Problems, and Practical Recommendations

Despite progress, feature attribution remains severely challenged by:

Computational Barriers: Exact FFA is #P-hard; anytime dual-enumeration schemes provide high-quality approximations, but scalability is limited to moderate-dimensional, logically encodable models (Yu et al., 2023).
Baseline and Masking Sensitivity: Choice of reference input (baseline) for IG/occlusion alters sign and localization. Baseline selection remains an open problem, with implications for both theoretical faithfulness and empirical stability (Zhang et al., 2023).
Data Distribution and Support: Most sampling and surrogate-based methods violate the principle that explanations should respect the empirical distribution. Distribution-based methods (DFAX) address this by relying strictly on i.i.d. data, but interaction attributions remain unaddressed (Li et al., 12 Nov 2025).
Synthetic Generalizability: Performance differences between methods and metrics do not consistently transfer across datasets or tasks, emphasizing the necessity of context-specific benchmarking pipelines and statistical tests (Gevaert et al., 2022).
Overfitting and Inter-method Consistency: Many popular proxies (LIME, SHAP, gradients) fail necessary axioms and can provide wrong-instance-wise attributions, as evidenced by synthetic datasets with known ground truth (Afchar et al., 2021).

6. Emerging Directions and Generalization

Research is converging on several themes:

Unification via First Principles: New frameworks construct attributions from atomic effects on indicator functions (building Block Riesz-Markov representations) and permit the derivation and optimization of new methods tuned to architectural or task-specific desiderata (Taimeskhanov et al., 30 May 2025).
Flexible Parameterizations: Weighted Möbius Scores and context-aware argumentation frameworks provide families of attribution algorithms whose weights or context integrations can be adapted for interpretability, fairness, or causality (Jiang et al., 2023, Zhong et al., 2023).
Principled Evaluation: Dual metrics (soundness/completeness) and perceptual or application-aligned criteria are now used alongside traditional perturbation metrics to asses the practical value and reliability of attributions in high-stakes domains (Li et al., 2023).

Overall, the technical maturity of feature attribution research now demands exacting standards of mathematical and computational rigor, task-aligned evaluation, domain-specific customization, and an ongoing critical re-examination of theoretical foundations and practical trade-offs underpinning interpretability in complex AI systems.