Feature Attribution Methods

Updated 7 March 2026

Feature attribution methods are techniques that assign importance scores to input features, clarifying model predictions for explainable AI.
They include gradient-based, perturbation-based, and game-theoretic approaches, each balancing theoretical rigor with practical trade-offs.
Evaluation methods like soundness, completeness, and sensitivity metrics are used to ensure reliable and faithful attributions.

Feature attribution methods aim to quantify and explain which input features are most responsible for a model’s prediction. These methods have become central in interpretable machine learning (XAI), model auditing, and scientific discovery with black-box systems. Approaches range from gradient-based heuristics for deep neural networks to theoretically grounded frameworks drawing from cooperative game theory, statistical dependence, and causal inference. Differences in methodology, evaluation, and target domain have led to diverse classes of methods with varying practical, theoretical, and domain-specific trade-offs.

1. Conceptual Foundations and Taxonomy

The goal of feature attribution is to decompose a model’s prediction $f(x)$ into contributions from each feature or group of features. At the most basic level, a feature attribution method produces a vector $a \in \mathbb{R}^d$ , where $a_i$ quantifies the importance of $x_i$ for the output $f(x)$ . Most methods can be organized as follows:

Gradient-based methods: Compute attributions via partial derivatives of the output w.r.t. each feature—Saliency, Integrated Gradients (IG), Input $\times$ Gradient, SmoothGrad, DeepLIFT, LRP, and Grad-CAM belong to this class (Suh et al., 2022).
Perturbation-based methods: Estimate feature importance by occlusion or replacement and measure prediction change—feature ablation, LIME, KernelSHAP, and Extremal Perturbation fall here (Mothilal et al., 2020, Suh et al., 2022).
Game-theoretic approaches: Decompose model output using axiomatic or cooperative-game principles—especially the Shapley value, its variants (SHAP, KernelSHAP, DeepSHAP), and recent generalizations like the Weighted Möbius Score (Jiang et al., 2023).
Submodular function and selector-predictor methods: Learn submodular scoring functions to improve selectivity or train masking and prediction modules jointly (e.g., L2X, VIBI, REAL-X, DoRaR) (Manupriya et al., 2021, Qin et al., 2023).
Context-aware and argumentation-based frameworks: Integrate user context or explicit argumentative structure when attributing importance (Zhong et al., 2023).
Higher-order and interaction-based methods: Move beyond univariate explanations, quantifying not only main effects but also pairwise and higher-order interactions (e.g., Shapley-Taylor, higher-order IG, Möbius/interaction indices) (Butler et al., 7 Oct 2025, Jiang et al., 2023).

These foundations are connected by formal frameworks that link attribution to relaxed functional dependence, statistical interaction, and causal mediation (Afchar et al., 2021, Taimeskhanov et al., 30 May 2025, Jiang et al., 2023, Butler et al., 7 Oct 2025).

2. Formal Evaluation and Faithfulness Metrics

A persistent challenge is assessing whether an attribution method is “faithful” to what the model actually uses. Faithfulness-based evaluation strategies probe whether perturbing features with high or low attribution measurably impacts model predictions. Common approaches include:

Dual Soundness and Completeness Framework: Soundness measures what fraction of the attribution mass falls on truly predictive features; completeness measures what fraction of total predictive information is captured by the selected features. These can be computed via changes in model performance as features are masked, under the assumption that loss in performance reflects loss of predictive information (Li et al., 2023).
Order-Only Metrics: Insertion/Deletion (MoRF/LeRF), ROAD, ROAR, and adversarial patch-based coverage—these measure how model confidence changes as features are removed in attribution order but disregard the magnitude of attributions (Gevaert et al., 2022, Li et al., 2023).
Segmented and infidelity metrics: Sensitivity- $n$ , Segment Sensitivity- $n$ , Infidelity (quantifying deviation between attributions and actual model perturbation effects), Impact Coverage, and Max-Sensitivity (Gevaert et al., 2022).
Task-specific metrics: For ECG, custom metrics include localization (IoU with abnormal beats), the pointing game, and degradation (MoRF-LeRF area) (Suh et al., 2022).

Table: Metrics, Interpretations, and Domains

Metric Type	Example Metric(s)	Key Domain or Use
Soundness/Completeness	Soundness, Completeness	General, robust faithfulness (all domains)
Order-Only (Perturbation)	MoRF/LeRF, Deletion	Image, text, ECG practical evaluation
Segmented Sensitivity/Infidelity	SegSens, Infidelity	High-dimensional inputs (images, timeseries)
Domain-Specific	IoU, Degradation Score	ECG, speech, objects with ground-truth annotation

The dual perspective (soundness, completeness) is more sensitive to changes in attribution values and avoids the artifacts and retraining biases of many older metrics (Li et al., 2023).

3. Theoretical Frameworks and Mathematical Structure

Several recent works have established unifying mathematical frameworks for feature attribution.

Weighted Möbius Score: Generalizes existing and new attribution methods as linear combinations of “Möbius-transformed” pure contributions, corresponding to Harsanyi dividends in cooperative game theory. By varying the weight function, one recovers Shapley value, Shapley-Taylor, interaction indices, and causal mediation effects (Jiang et al., 2023).
Higher-Order Attributions: Systematic treatment of higher-order attributions via operator composition, generalizing IG and tying to ANOVA/Sobol classical statistics. These yield completeness properties not just for features but for higher-order interactions, and relate directly to simplicial topology (nodes, edges, triangles for 1st, 2nd, 3rd order) (Butler et al., 7 Oct 2025).
Constructivist Measure-Theoretic Approaches: Atomic attributions for indicator functions are extended via the Riesz–Markov theorem to general continuous models, unifying prior approaches (Integrated Gradients, Shapley, PDP) as integrals against measures (Taimeskhanov et al., 30 May 2025).
Relaxed Functional Dependence: Defines attributions in terms of subsets $I$ that determine $Y$ approximately, leading to instance-wise minimal explaining sets and structural properties such as dependence hierarchy and complementary dependence (Afchar et al., 2021).

These frameworks clarify which aspects of attribution methods are axiomatic, which are artifacts of implementation, and point to trade-offs between completeness, computational tractability, and interpretability.

4. Open Challenges: Robustness, Reliability, and Limitations

Major limitations, trade-offs, and open issues have been identified:

Robustness and Output Similarity: Traditional robustness metrics based on $a \in \mathbb{R}^d$ 0-norm neighborhoods fail to separate instability due to the model from instability due to the attribution map itself. Output Similarity-based Robustness (OSR) addresses this by considering only perturbations that preserve class logits within a threshold, with generative adversarial networks used to explore the space of truly output-similar inputs (Kiourti et al., 7 Dec 2025).
Reliability in Domain-Specific Tasks: In domains like speech, feature attribution reliability is highly sensitive to aggregation granularity. For speech intent classification, only word-aligned perturbation-based methods achieve high inter-run agreement; for tasks relying on distributed cues (e.g., gender or speaker ID from voice), no tested method produces reliable attributions (Shen et al., 22 May 2025). In ECG analysis, activation-based methods (notably Grad-CAM) dominate due to the temporal localization of abnormalities (Suh et al., 2022).
Confirmation Bias and Semantic Match: Conventional attribution maps are susceptible to ungrounded interpretations. The semantic match framework formalizes the comparison between human-concept hypotheses and attribution outputs, offering median-distance (coherence) and discrimination (AUC) as quantitative metrics to guard against confirmation bias (Cinà et al., 2023).
Class-Dependence and Label Leakage: Standard methods such as SHAP, LIME, and Grad-CAM are class-dependent and may “leak” label information, biasing explanations towards the selected or true class and leading to inflated fidelity metrics. Distribution-aware methods (SHAP-KL, FastSHAP-KL) explicitly control the divergence from the full label distribution, thus avoiding leakage (Jethani et al., 2023).
Methodological Shortcomings: Practically, attribution maps evaluated via user studies may not improve, and can even degrade, human-AI team performance relative to strong baselines such as nearest-neighbor prototype retrieval or raw confidence scores (Nguyen et al., 2021). No single metric or method is universally optimal, and the utility of a method is inherently use-case dependent (Gevaert et al., 2022).
Artifacts and Masking: Mask-and-predict approaches suffer from out-of-distribution artifacts; retrained selector-predictor methods may encode class information in the mask itself. The Double-sided Remove and Reconstruct (DoRaR) approach, via dual generative reconstructions, mitigates both pitfalls and empirically achieves lower information leakage (Qin et al., 2023).

5. Empirical Comparisons and Domain-Specific Insights

Systematic empirical studies reveal that method performance, faithfulness, and usefulness are highly context-dependent:

Image domain: Saliency, Grad-CAM, DeepSHAP, LIME, KernelSHAP, and region-based methods all have strengths on specific metrics and datasets. No method is strictly superior across MoRF, Sensitivity-n, Infidelity, or Impact Coverage; segmentation (e.g., SegSens-n) improves stability on large images (Gevaert et al., 2022).
Electrocardiogram (ECG) data: Grad-CAM surpasses all other methods by a substantial margin on IoU-based localization, pointing game, and MoRF/LeRF-based degradation scores. Saliency and LIME perform poorly for temporal localization (Suh et al., 2022).
Speech classification: Aggregation at linguistic or semantic units (words, phonemes) is essential for reliable attributions; otherwise, both gradient- and perturbation-based methods produce highly inconsistent results (Shen et al., 22 May 2025).
User studies: Visual saliency maps (Grad-CAM, Extremal Perturbation) often do not improve—and can inhibit—human performance in image classification tasks when compared to example-based (nearest neighbor) approaches or even confidence-only baselines, with weak-to-no correlation between human utility and standard proxy metrics (IoU, localization, pointing game) (Nguyen et al., 2021).

These results emphasize the necessity for benchmarking attribution methods in situ, using domain- and task-specific metrics, and incorporating human-comprehension tests when explanations are intended for end-user interpretation.

6. State-of-the-Art, Unified Views, and Future Directions

Recent contributions have unified diverse attribution methods—Shapley, IG, submodular-maximization, interaction indices, and context-aware attributions—within mathematically rigorous frameworks, and highlighted limitations inherent in both older and more recent XAI paradigms:

Unified kernel- and Möbius-based decomposition provides a vector-space view: all local and interaction-aware attributions are (weighted) sums of pure feature and joint-feature contributions, allowing transfer of techniques between cooperative game theory, mediation analysis, and interpretable ML (Jiang et al., 2023, Butler et al., 7 Oct 2025).
Higher-order attributions enable modelers to uncover not only individual feature effects but also synergistic, redundant, and antagonistic interactions among features, with connections to statistics and topological data analysis (Butler et al., 7 Oct 2025).
Submodular marginal-gain learning increases specificity/selectivity and guards against redundancy without sacrificing discriminative power (Manupriya et al., 2021).
Context integration and tripolar argumentation introduce frameworks where user, environment, and feature context all modulate attributions in transparent, structured ways (Zhong et al., 2023).
Faithfulness metrics are being refined—soundness/completeness, output-similarity-based robustness, semantic match, and distribution-aware evaluation set the new standard for practical, theoretically justified XAI validation (Li et al., 2023, Kiourti et al., 7 Dec 2025, Cinà et al., 2023, Jethani et al., 2023).

Open questions include: optimizing for both robustness and faithfulness, achieving reliable attributions in domains with distributed or high-dimensional features, quantifying higher-order effects efficiently, and grounding semantic match in more robust human-model communication protocols.

A plausible implication is that future progress will be driven by hybrid frameworks combining mathematically principled axioms, task-adaptive modeling, user- and context-awareness, and empirically validated metrics—extending the value of feature attribution from mere model auditing to actionable, trustworthy human-AI collaboration.