Training-Time Attribution

Updated 16 October 2025

Training-time attribution is a set of methodologies that trace, localize, and quantify how specific training data features influence model predictions.
Hybrid training-feature attribution combines instance and feature attribution using gradient-based techniques to pinpoint decisive features within influential training samples.
This approach improves model auditing by uncovering spurious correlations and artifacts, especially when validated with challenging datasets.

Training-time attribution is the set of methodologies and principles aimed at tracing, localizing, and quantifying how specific training data—often down to the level of features or regions within examples—influence the learned behaviors and test-time predictions of machine learning models. The goal is to bridge the explanatory gap between influential training points (instance attribution) and their decisive features (feature attribution) in order to reveal spurious correlations or annotation artifacts that may degrade model generalization, especially in settings with noisy or biased data.

1. Attribution Methodologies

Approaches to training-time attribution can be categorized along two axes: feature attribution and instance attribution. Feature attribution methods, often realized as “saliency maps,” quantify the contribution of input features (such as tokens in text or pixels in images) to a model’s output. Gradient-based techniques—such as simple input gradients or Integrated Gradients (IG)—are prototypical, assigning higher importance to features with larger pre-softmax gradient magnitudes.

Instance attribution, in contrast, estimates the influence of individual training examples on a given prediction. This is typically formalized via influence functions, representer methods, or heuristics like distance in feature space. For a test instance $z_t$ and training instance $z_i$ , influence scores $I(z_t, z_i)$ approximate the impact of upweighting $z_i$ on the prediction for $z_t$ , retracing the model’s sensitivity back to specific training points.

The key limitation in deploying each of these in isolation is their explanatory incompleteness: feature attribution only illuminates test-time saliency, while instance attribution, though critical for identifying harmful or beneficial training samples, rarely pinpoints the problematic region or feature.

2. Hybrid Training-Feature Attribution

The principal methodological innovation is the synthesis of feature and instance attributions, referred to as “training-feature attribution” (TFA) (Pezeshkpour et al., 2021). TFA combines instance retrieval with feature localization to expose not just which training examples are influential, but also which features (tokens) within those instances drive predictions. Specifically, for a test instance $z_t$ and influential training instance $z_i$ , TFA computes the gradient of the influence score with respect to the input features of $z_i$ :

$\text{TFA}(z_t, z_i) = \nabla_{x_i} I(z_t, z_i)$

This yields a heatmap over the input of the training sample, highlighting decisive features. Aggregating top features across the highest-influence training instances can further accelerate identification of systematic artifact patterns (such as repeated punctuation or spurious ratings in NLP classification). In practice, scalar importance scores per feature are usually derived by averaging gradients over the input embedding dimensions.

Training-feature attribution operationalizes “localization of influence,” allowing precise tracing of spurious correlations (artifacts) to their origins in the training data—an explanatory capability not achievable by feature or instance attribution alone.

3. Validation Set Considerations

A critical requirement for effective training-time attribution is the availability of a “challenging” validation set. Artifact detection through TFA is most effective when the validation set comprises hard or anomalous examples, forcing models to leverage features or shortcuts in the data that may not generalize. When the validation set contains such difficult instances, spurious correlations (e.g., extraneous numerical ratings or unusual punctuation in sentiment analysis) exhibit amplified influence signals, facilitating their discovery through gradient-based TFA methods (Pezeshkpour et al., 2021).

Thus, the choice and construction of the validation set are central: standard or “clean” test sets may obscure or dilute artifact signals, while a challenging validation set sharpens the attribution results and increases interpretive value.

4. User Study Evidence

Empirical studies support the practical value of training-feature attribution. In a user paper with NLP/ML graduate students (Pezeshkpour et al., 2021), different visualization modalities (instance attribution, feature attribution, and hybrid TFA) were used during debugging tasks. Users employing TFA were more accurate in identifying synthetic artifacts. While TFA sometimes led to higher cognitive and computational effort (due to its granularity), it consistently resulted in more successful artifact detection compared to instance attribution alone, which often required exhaustive manual inspection.

These findings establish hybrid TFA as the preferred debugging and audit tool for systematically surfacing and localizing data artifacts.

5. Technical Implementation and Limitations

The implementation of training-feature attribution relies on gradient computations with respect to either pre-softmax logits or influence scores between test and training pairs, aggregated (for text) over token embeddings. A variant of TFA further trains a discriminator—such as a logistic regression on bag-of-words representations of top-versus-bottom influential samples—to highlight tokens whose high weights indicate artifact status.

While TFA is effective across various benchmarks (e.g., IMDB, HANS, DWMW17, BoolQ), its performance and interpretability can be sensitive to the choice of gradient-based attribution method and the scale of the model. Furthermore, the interpretability of TFA diminishes if the validation set lacks difficult cases or if the artifacts are subtle enough to evade influence-based signals.

The codebase implementing these methods, including feature attribution, instance attribution, and TFA, is publicly available(Pezeshkpour et al., 2021), enabling reproducibility and further development.

6. Applications and Future Directions

Training-time attribution, and in particular hybrid TFA, plays a central role in:

Uncovering annotation artifacts and spurious shortcuts in large, automatically or crowd-sourced datasets.
Auditing models for reliance on non-causal, non-generalizable features prior to deployment.
Guiding targeted data curation or filtering, as TFA reveals precisely which features require remediation for improved out-of-distribution robustness.

Potential future directions include advancing the granularity of TFA (e.g., towards phrase-level or higher semantic concept attribution), integrating with automated data cleaning systems, and further formalizing the relationship between influence signals and out-of-distribution generalization failure modes.

7. Summary Table

Attribution Type	Target of Attribution	Strengths
Feature attribution	Input features (tokens)	Highlights salient features
Instance attribution	Training instances	Identifies influential data
Training-feature attribution	Training features/tokens	Localizes data artifacts, spurious features

Training-feature attribution thus constitutes a potent diagnostic and interpretability tool, allowing rigorous tracing of model decisions back not just to individual data points, but to the specific features within those points responsible for learned artifacts. Its integration into the model development lifecycle enables more robust, accountable, and transparent NLP model construction.

PDF Markdown Chat (Pro)

References (1)

Combining Feature and Instance Attribution to Detect Artifacts (2021)

Follow Topic

Get notified by email when new papers are published related to Training-Time Attribution.