Inference-Time Attribution

Updated 16 October 2025

Inference-time attribution is a suite of techniques that assigns quantitative importance to input features, training examples, or reference data at prediction time, ensuring interpretability and control.
It encompasses methods like feature, data, and causal attribution, leveraging gradients, perturbations, and counterfactuals to reveal the origins of model decisions.
These methods enhance safety, compliance, and debugging in high-stakes applications, providing verifiable explanations across modalities such as vision, language, and generative models.

Inference-time attribution refers to the suite of techniques, frameworks, and mathematical formalisms developed to explain, quantify, and often verify how model predictions at inference are driven by specific input features, training examples, or explicit reference data. Unlike training-time attribution, which assesses how training data shaped learned parameters, inference-time attribution provides actionable explanations, provenance, and control over model outputs at the point of prediction. Recent research spans modalities including time series, vision, language, generative models, and even structured probabilistic systems; common to all is the objective of delivering reliable, interpretable, often verifiable explanations relevant at inference, addressing high-stakes scenarios such as safety, compliance, ethical generative content, and robust debugging.

1. Core Principles and Definitions

Inference-time attribution is formally defined as the process (or family of processes) that assigns, at the moment of test-time evaluation, quantitative or qualitative responsibility to model inputs, reference signals, or training instances for a specific output decision. Methods can be categorized along several axes:

Feature Attribution: Quantifies how each input variable or segment (e.g., time point, pixel, token) contributes to the output, often via saliency vectors, gradients, or perturbation-based importance scores (Siddiqui et al., 2020, Schlegel et al., 2021, Mercier et al., 2022, Schlegel et al., 2023, Schneider et al., 17 Feb 2025, Bacha et al., 10 Oct 2025).
Data Attribution: Identifies the influence of individual (or groups of) training examples on specific inference outcomes (Xie et al., 17 Jan 2024, Yolcu et al., 19 Feb 2024, Sun et al., 24 May 2025, Karchmer et al., 14 Aug 2025).
Causal Attribution: Decomposes output predictions into causal contributions using formal intervention-based logic, including counterfactual tests and projections within structured models (Amin, 17 May 2025, West et al., 12 Sep 2025, Quan et al., 4 Oct 2025).
Inference-Time Provenance: Embeds provenance at generation directly, ensuring all conditioning or reference sources used at inference are explicitly logged, as in music and creative systems (Morreale et al., 9 Oct 2025).

Fundamental goals include interpretability, faithfulness (does the attribution match the true causal or influential structure?), robustness (does the attribution resist spurious sensitivity?), and—where needed—verifiability under computational constraints (Karchmer et al., 14 Aug 2025).

2. Methodological Taxonomy

A diverse array of inference-time attribution approaches exists:

2.1 Feature-focused Attribution

Auto-Encoder Augmentation: TSInsight attaches and fine-tunes a sparsity-regularized auto-encoder to a time-series classifier, generating sparse reconstructions acting as feature attribution maps. Key mathematical formulation:

$L_{TSInsight} = L(\Phi(D(E(x))); y) + \gamma \|x - D(E(x))\|^2_2 + \beta \|D(E(x))\|_1 + \lambda (\|W_E\|_2^2 + \|W_D\|_2^2)$

During inference, the reconstructed output highlights regions critical for model decision, both instance- and model-level (Siddiqui et al., 2020).

Gradient and Perturbation-based Maps: Saliency, Integrated Gradients, Guided Backpropagation, Occlusion, and Shapley Value Sampling assign per-feature scores via either gradients or systematic ablations. The benchmark in (Mercier et al., 2022) demonstrates the trade-off: gradient-based methods are fast but may be noisy, while perturbation methods yield more continuous, human-consistent maps at higher computational cost.
Regularized Contrastive Learning for Identifiability: xCEBRA combines contrastive learning with Jacobian regularization and proposes the "Inverted Neuron Gradient," computing the pseudoinverse of the encoder’s Jacobian after training. The resulting attribution map is theoretically guaranteed to reveal the true (zero vs. nonzero) causal connections in the generative process, a significant advance over previous connections-less saliency methods (Schneider et al., 17 Feb 2025).

2.2 Data Attribution

Surrogate Modeling and Dual Representations: DualView fits a multiclass SVM to the penultimate layer of a network, interprets dual variables as data influences, and links test predictions to supporting or detracting training samples. The key formula:

$A^{DV}_{c,i}(x) = \begin{cases} \sum_{j \neq c} \lambda_{ij} f(x_i)^\top f(x), & \text{if } y_i = c \ -\lambda_{ic} f(x_i)^\top f(x), & \text{if } y_i \neq c \end{cases}$

This enables fine-grained, efficient attribution compatible with pixel/feature heatmaps (Yolcu et al., 19 Feb 2024).

Influence Functions, Representation-based Ranking: AirRep sidesteps expensive gradient/Hessian computations by learning encoder plus attention-based pooling optimized for attribution ranking, making groupwise influence estimation orders of magnitude faster than classical methods and scalable to LLMs (Sun et al., 24 May 2025).
Uncertainty Estimation: Daunce attributes by measuring the covariance of per-example losses across multiple randomly perturbed models ( $\mathbb{I}(x_i, x_j)$ in Eqn 2), aligning with influence-function intuitions while being feasible even for black-box models (Pan et al., 29 May 2025).

2.3 Causal and Counterfactual Frameworks

Taylor Interactions Unification: All ("fourteen") major attribution methods can be seen as distributing independent and interaction effects identified via the Taylor expansion of the network output around a baseline, with each method characterized by a distinct weighting of higher-order interactions (Deng et al., 2023).
Attribution Projection Calculus (AP-Calculus): Causal attribution is performed within a structured Bayesian network. The system assigns each label a deconfounder intermediate node and computes attribution via projected gradients, enabling fine-grained, label-specific, and interference-aware explanation. Additivity and preservation axioms ensure correct causal decomposition (Amin, 17 May 2025).
Counterfactual Reasoning for Failure Attribution: In multi-agent and dialogue systems, A2P scaffolding enacts abduction, intervention, and outcome simulation—structuring failure attribution as explicit counterfactuals. Empirical results show $2\times$ – $3\times$ improvement in step-level accuracy versus corresponding baselines (West et al., 12 Sep 2025).
Domain-adapted Statistical Causality: Real-time attack attribution in 6G networks leverages enhanced Granger causality with resource contention modeling, producing interpretable, statistically-guaranteed attributions tailored to dynamic, multi-tenant environments (Quan et al., 4 Oct 2025).

3. Evaluation Protocols and Robustness Metrics

Evaluation of inference-time attribution encompasses both qualitative visualization (heatmaps, overlays, bar charts for time series) and quantitative metrics:

Perturbation Analysis: Input suppression tests and occlusion games assess whether perturbing the most-attributed features substantially alters predictions (Siddiqui et al., 2020, Mercier et al., 2022).
Stability and Trustworthiness: The Attribution Stability Indicator (ASI) aggregates class flips, prediction probability deviations (Jensen–Shannon distance), attribution similarity (via Pearson correlation), and controlled perturbation distance into a composite score; lower ASI values imply robust, interpretable explanations (Schlegel et al., 2023).

Metric	Assesses	Noted Property
Infidelity	Attribution match to output Δ	Lower is better (faithful)
Sensitivity	Attribution noise w/ input Δ	Lower is better (stable)
Occlusion game	Attribution ranking fidelity	Hard for gradient methods
Runtime	Feasibility for real-time use	Gradient ≪ perturbation-based

Verifiability: Efficiently verifiable protocols allow parties with limited compute to check $\epsilon$ -closeness of Prover-supplied attribution vectors to optima, via PAC-style challenge–response mechanisms using spot-check retraining and Fourier residual estimation. Verifier cost is $O(1/\epsilon^2)$ and independent of dataset size, which marks a departure from traditional MSE-based checks or cryptographic proofs (Karchmer et al., 14 Aug 2025).

4. Trade-offs, Applications, and Interpretability

The diversity of inference-time attribution methods reflects differing trade-offs suited to target domains:

Safety-Critical Domains: Preference for robust, continuous, and human-comprehensible attributions—perturbation-based and causal inference approaches are prominent (Siddiqui et al., 2020, Mercier et al., 2022, Quan et al., 4 Oct 2025).
Scalability and Black-Box Models: Uncertainty-based (Daunce) and representation learning (AirRep) methods enable attribution in settings (such as LLM APIs) where gradients or internals are not accessible (Sun et al., 24 May 2025, Pan et al., 29 May 2025).
Ethics, Rights, and Compensation: Attribution-by-design in generative systems, especially music, codifies provenance during inference—ensuring that artist-chosen reference works condition output and trigger verifiable, direct royalty mechanisms; this approach sidesteps the uncertainties of similarity-based TTA in creative domains (Morreale et al., 9 Oct 2025).
Debugging and Model Forensics: Training Feature Attribution (TFA) links test-time mispredictions not just to at-risk inputs but to the precise regions in problematic training images, rooting out spurious correlations and patch-based shortcuts unobservable to classic saliency methods (Bacha et al., 10 Oct 2025).
Real-Time Steerability: GrAInS exploits integrated gradients to create token-level (LLMs) or multimodal (VLMs) steering, altering model activations during inference for factuality/safety orientation, with empirical improvements over static interventions or fine-tuning (Nguyen et al., 24 Jul 2025).

5. Mathematical and Theoretical Underpinnings

Inference-time attribution methods are increasingly grounded in rigorous mathematical frameworks:

Taylor Expansion and Interaction Formalism: Attribution as decomposition of output increments into sums of independent ( $\psi(i)$ ) and interaction ( $J(S)$ ) effects; fairness principles stipulate complete, non-leaking, and faithful allocation (Deng et al., 2023).
Identifiability Guarantees: Regularized contrastive learning (xCEBRA) plus pseudo-inverse gradients reliably pinpoints connections in the Jacobian of the data-generating process; block-diagonality and minimal norm solutions underpin the approach (Schneider et al., 17 Feb 2025).
Statistical Causality: Structured F-tests, controlling for resource allocation, and confidence scoring by blending statistical and domain-specific components allow for real-time, interpretable attributions with formal error bounds (Quan et al., 4 Oct 2025).
Residual Estimation under PAC: Efficient residual estimation, via Boolean Fourier analysis, establishes error bounds on attributions in interactive verification protocols (Karchmer et al., 14 Aug 2025).

6. Future Directions and Open Challenges

Key frontiers and challenges for inference-time attribution include:

Instance vs Global Attribution: Methods like TSInsight and xCEBRA offer both instance-specific and global explanations. Developing methodologies that allow users to tune the locality/generalizability tradeoff remains an open research question (Siddiqui et al., 2020, Schneider et al., 17 Feb 2025).
Complex Modalities and Structures: There is an increasing need to extend theoretical guarantees and robust methods into more complex generative, multi-agent, or high-dimensional probabilistic contexts (e.g., VAEs, structured LLMs, causal multi-agent frameworks) (West et al., 12 Sep 2025, Quan et al., 4 Oct 2025).
Integration of Causal and Statistical Attribution: Hybrid approaches that combine formal causal intervention with robust statistical metrics (e.g., projection calculus with resource modeling) promise richer and more trustworthy attributions, especially for applications at regulatory or ethical frontiers (Amin, 17 May 2025, Quan et al., 4 Oct 2025).
Scalability, Efficiency, and Verification: Further work on scalable, representation-based or black-box-compatible attribution, as well as on efficient verification for resource-limited parties, is central as deployment scales and legal/ethical scrutiny increases (Sun et al., 24 May 2025, Karchmer et al., 14 Aug 2025).
Understanding Training Feature Attributions: Continued exploration of methods that associate test decisions back to explicit regions in training data, as in TFA, will likely yield deeper insights regarding shortcuts, bias, and trustworthiness in vision and sequential models (Bacha et al., 10 Oct 2025).

In summary, inference-time attribution has evolved into a mathematically rigorous, multimodal, and increasingly actionable field, delivering interpretable, controllable, and verifiable explanations critical for trustworthy and transparent AI deployment across domains.