Source-level Counterfactual Attribution

Updated 4 July 2026

Source-level Counterfactual Attribution is a framework that defines causal responsibility by comparing actual outcomes with counterfactual ones through targeted source interventions.
It employs methodologies such as trace re-execution, RL checkpoint comparisons, and evidence deletion to precisely gauge the influence of individual source elements.
Empirical evaluations demonstrate its potential in enhancing model accuracy, generating actionable repairs, and refining dataset quality across diverse applications.

Source-level Counterfactual Attribution (SCA) is a family of attribution frameworks that localize causal responsibility to a specified “source” and evaluate that responsibility by comparing actual outcomes with counterfactual outcomes under source-specific interventions. Across recent work, the source can be a step in an LLM agent trace, an atomic source in a Reinforcement Learning from Verifiable Rewards (RLVR) dataset, a training example or source corpus in training-data attribution, or a retrieved document or evidence cluster in Retrieval-Augmented Generation (RAG). What unifies these formulations is the counterfactual question: what would change if this source were altered, removed, reweighted, or relabeled, with all other relevant conditions held fixed or re-executed according to the causal structure of the system (Bonagiri et al., 25 May 2026).

1. Conceptual scope and defining intuition

In the agent setting, SCA is described as taking a concrete execution trace of an agent and asking, for each step, “If this step had been different in this way, would the overall outcome have changed from failure to success?” A failed trace is written as

$\tau = (s_1, s_2, \dots, s_T),$

with a task-specific verifier

$\mathcal{V}(y(\tau), x) \in \{0,1\},$

and source-level attribution identifies which step $s_i$ is causally responsible for failure because replacing it appropriately and re-running the downstream computation flips the outcome to success (Bonagiri et al., 25 May 2026).

In the RLVR data-lineage setting, SCA is defined at the granularity of atomic sources rather than individual samples. The counterfactual comparison is between a shared base model $\theta_0$ and a checkpoint $\theta_s$ obtained by training only on source $s$ . The resulting difference is used as a per-source marginal utility estimate and also to label each instance with a learnability category (Huang et al., 26 May 2026).

In influence-style training-data attribution, the same counterfactual intuition appears as the first-order effect of up- or down-weighting a training example or source on a test functional. The classical target is

$\tau_{\mathrm{IF}(b\mid q)} := -\,\frac{1}{N}\, g_q^\top\, H^{-1}\, g_b,$

and, for a source $S$ , the source-level effect is the sum over all examples in $S$ (Ma et al., 25 Nov 2025).

In RAG systems, SCA is instantiated as counterfactual deletion of retrieved evidence. Given a question $q$ , evidences $\mathcal{V}(y(\tau), x) \in \{0,1\},$ 0, and answer $\mathcal{V}(y(\tau), x) \in \{0,1\},$ 1, the system removes one evidence cluster at a time, regenerates a counterfactual answer, and compares it with the original answer. The more the answer changes, the more causal influence is attributed to that source evidence or source URL (Roy et al., 2024).

A plausible implication is that SCA is better treated as a design pattern than as a single algorithm. The recurring structure is source identification, counterfactual intervention, outcome comparison, and aggregation into an attribution or credit signal.

2. Core formalizations of source-level counterfactuals

The most explicit formalizations in the literature differ by intervention target, but they all define attribution through a counterfactual change in behavior rather than through similarity alone.

CausalFlow models LLM-agent execution as a sequential causal chain and defines a counterfactual trace

$\mathcal{V}(y(\tau), x) \in \{0,1\},$ 2

by replacing step $\mathcal{V}(y(\tau), x) \in \{0,1\},$ 3 with $\mathcal{V}(y(\tau), x) \in \{0,1\},$ 4 and recomputing all subsequent steps $\mathcal{V}(y(\tau), x) \in \{0,1\},$ 5. The outcome variable is the verifier $\mathcal{V}(y(\tau), x) \in \{0,1\},$ 6, and the central attribution statistic is the Causal Responsibility Score (CRS): $\mathcal{V}(y(\tau), x) \in \{0,1\},$ 7 Thus $\mathcal{V}(y(\tau), x) \in \{0,1\},$ 8 exactly when at least one local intervention at step $\mathcal{V}(y(\tau), x) \in \{0,1\},$ 9 flips the final verifier from failure to success (Bonagiri et al., 25 May 2026).

ATLAS-based SCA for RLVR defines correctness indicators $s_i$ 0 for the base model and $s_i$ 1 for the source-specific RL checkpoint, then partitions instances into four categories: 00, 01, 10, and 11. For a source $s_i$ 2, the category proportions are

$s_i$ 3

and the SCA-based learnability score is

$s_i$ 4

This treats the source as the intervention unit and the change from $s_i$ 5 to $s_i$ 6 as the counterfactual treatment (Huang et al., 26 May 2026).

Forward-only influence-style attribution preserves the same first-order counterfactual target as classical influence functions, but estimates it by short-horizon gradient propagation and test-time forward evaluation: $s_i$ 7 As $s_i$ 8, this converges to

$s_i$ 9

and source-level attribution is obtained by summing over all examples in the source (Ma et al., 25 Nov 2025).

Approximate unrolled differentiation in Source defines attribution through the change in final parameters after reweighting a training source during training, then projects that parameter change onto a query functional. Its segment-based formula approximates finite-time counterfactual retraining effects and is explicitly designed for non-converged models and multi-stage pipelines (Bae et al., 2024).

In RAG, the formal intervention is evidence removal. After clustering redundant evidences, the counterfactual context is

$\theta_0$ 0

the model generates $\theta_0$ 1, computes a similarity $\theta_0$ 2, averages over Monte Carlo samples, and converts the resulting scores into an attribution distribution by softmax (Roy et al., 2024).

3. Operational paradigms and source granularities

The recent literature uses SCA at several distinct granularities. The following summary stays close to the terminology used in the papers.

Paradigm	Source unit	Counterfactual operation
LLM agents	Step $\theta_0$ 3 in a trace	Replace step and sequentially re-execute descendants
RLVR dataset curation	Atomic source $\theta_0$ 4	Train $\theta_0$ 5 from $\theta_0$ 6 using only $\theta_0$ 7
Training-data attribution	Example $\theta_0$ 8 or source $\theta_0$ 9	Up/down-weight in training objective
Enterprise RAG	Evidence cluster or source URL	Remove cluster from retrieved context
Autoregressive credit attribution	Document $\theta_s$ 0 in deployment-time dataset	Compare factual output with output under source removal, conditioned on non-credit

In CausalFlow, step types are explicitly logged, including REASONING, TOOL_CALL, TOOL_RESPONSE, LLM_RESPONSE, MEMORY_ACCESS, and FINAL_ANSWER. This typed trace structure provides precise intervention points and explicit dependencies for re-execution (Bonagiri et al., 25 May 2026).

In ATLAS, the source unit is an atomic source such as olympiads, stack_exchange, gsm8k, or synthetic_math. The purpose is to avoid provenance collapse and to make per-source RL interventions meaningful and comparable because all $\theta_s$ 1 checkpoints start from the same base model and use the same RL algorithm and hyperparameters (Huang et al., 26 May 2026).

In RAGONITE, the source unit is an evidence derived from a heterogeneous corpus: passages, lists, entire tables, and verbalized table rows. Contextualization augments each evidence with page title, previous heading, the evidence before, and the evidence after. This makes the evidence self-contained and source-aware, so attribution can be reported at the level of page URL, section, or table row (Roy et al., 2024).

In the autoregressive credit-attribution literature, the source is a record $\theta_s$ 2 in a deployment-time dataset $\theta_s$ 3, and a credit-attributing algorithm returns both an output $\theta_s$ 4 and a credit set $\theta_s$ 5. Counterfactual Credit Attribution (CCA) requires that if $\theta_s$ 6 is not credited, then the output distribution conditioned on non-crediting $\theta_s$ 7 must be indistinguishable from the output distribution when $\theta_s$ 8 is removed from the dataset (Cohen et al., 2 May 2026).

This suggests that “source level” is not tied to one canonical unit. It is instead the lowest granularity at which an intervention is judged meaningful, computationally feasible, and semantically interpretable within the application.

4. Attribution scores, repairs, and derived supervision

Several SCA systems do not stop at attribution; they use counterfactuals to produce repairs, labels, or quality scores.

CausalFlow uses CRS only as the first stage. Once a step is judged causally responsible, it selects a minimal counterfactual repair by maximizing the token-level minimality score

$\theta_s$ 9

subject to successful verification of the repaired trace. The selected repair

$s$ 0

yields a validated contrastive pair $s$ 1, which the paper proposes for offline preference optimization, reward modeling, or domain-specific fine-tuning (Bonagiri et al., 25 May 2026).

ATLAS converts source-level counterfactual outcomes into dataset curation signals. The four categories 00, 01, 10, and 11 are interpreted as unsolved, genuinely learnable, degrade, and overly-easy cases respectively. Aggregating these with scale-dependent weights produces $s$ 2, and this is then used inside the composite dataset quality score

$s$ 3

where the static learnability term $s$ 4 is derived directly from SCA (Huang et al., 26 May 2026).

In RAGONITE, attribution over evidence clusters is mapped to source URLs. Evaluation uses the source URL of the highest-scoring cluster, and the system can expose the full attribution distribution in a user interface. Because cluster members retain metadata such as URL and table or row identity, the attribution remains source-grounded rather than merely text-similarity-based (Roy et al., 2024).

In the faithfulness-evaluation literature for autoregressive LLMs, counterfactual editing is used to assess whether an attribution method correctly identifies source tokens whose modification flips the model’s label while keeping inputs fluent and in-distribution. The protocol is contrastive, using

$s$ 5

and it ranks attribution methods by the mean percentage of tokens that must be edited to flip the prediction (Kamahi et al., 2024).

A plausible implication is that SCA often functions as a supervision generator. In agents it produces corrected trace fragments; in RLVR it produces learnability labels for instances and sources; in RAG it produces source-level explanation distributions; and in training-data attribution it produces marginal utility scores for examples or corpora.

5. Empirical evaluation and observed behavior

The empirical literature evaluates SCA with markedly different metrics, but a common pattern is comparison against heuristic or non-causal alternatives.

CausalFlow evaluates on GSM8K, MBPP, SealQA Hard, and MedBrowseComp. Its reported metrics include Repair Rate, Post-Repair Accuracy, Minimality Score, CRS Precision, and an optional Consensus score based on multi-agent validation. The paper reports that test-time repair converts 42.7% of failed executions into successes on average, improves accuracy by +30.8 percentage points on MedBrowseComp and +12.6 on SealQA Hard, and achieves average minimality scores 0.79–0.87 in most benchmarks. Reported CRS Precision ranges from 0.68 to 0.84 across tasks (Bonagiri et al., 25 May 2026).

ATLAS reports that its lineage analysis attributes over 99.7% of 1.45M instances to 20 atomic sources. For the composite quality score $s$ 6, the reported correlations with downstream RLVR performance are Pearson $s$ 7 and Spearman $s$ 8 for Qwen3-1.7B, and Pearson $s$ 9 and Spearman $\tau_{\mathrm{IF}(b\mid q)} := -\,\frac{1}{N}\, g_q^\top\, H^{-1}\, g_b,$ 0 for Qwen3-8B. In benchmark results, DAPO++ reaches approximately 15.7 Average* at 1.7B and approximately 29.6 at 8B, while on GPQA the reported Qwen3-8B + DAPO++ overall Mean@N is 55.4 (Huang et al., 26 May 2026).

RAGONITE evaluates on ConfQuestions, which contains 300 hand-created conversational questions, each in English and German, for 600 total questions, grounded in 215 public Confluence pages. With full contextualization (+ALL), retrieval Precision@1 rises from 0.440 to 0.523 and answer relevance from 0.435 to 0.585. Attribution accuracy at URL level is reported as 78.9% over 360 questions where gold URL is in top-10, with 80.6% on simple questions, 77.4% on complex questions, 77.7% on passage answers, 82.6% on list answers, 76.6% on table answers, 77.9% in English, and 80.0% in German (Roy et al., 2024).

Forward-only attribution is evaluated on the Dattri MNIST–MLP benchmark with 0.11M parameters, using LOO and LDS. The reported performance is LOO ≈ 0.022 and LDS ≈ 0.49, and the paper states that these scores match or exceed TRAK while offering forward-only inference (Ma et al., 25 Nov 2025).

Source is evaluated with LDS and subset-removal counterfactual evaluation across regression, image classification, text classification, and language modeling. The paper reports that Source outperforms existing TDA techniques in counterfactual prediction, especially for non-converged models and multi-stage training pipelines (Bae et al., 2024).

These results support a narrow but consistent empirical claim: when the intervention is aligned with the causal structure of the source unit, SCA tends to produce more localized or more predictive signals than methods that score sources without explicit counterfactual testing.

6. Robustness, bias, and theoretical barriers

The literature also emphasizes that source-level counterfactual reasoning is fragile when the intervention or crediting rule is poorly aligned with the generative process.

In RAG, authorship metadata can alter document attribution even when content is fixed. The attribution-bias study defines Counterfactually-estimated Attribution Sensitivity (CAS) as

$\tau_{\mathrm{IF}(b\mid q)} := -\,\frac{1}{N}\, g_q^\top\, H^{-1}\, g_b,$ 1

and Counterfactually-estimated Attribution Bias (CAB) as

$\tau_{\mathrm{IF}(b\mid q)} := -\,\frac{1}{N}\, g_q^\top\, H^{-1}\, g_b,$ 2

The reported results show that adding authorship information can change attribution quality by 3% to 18%, and the measured CAB values are consistently positive, indicating a bias toward explicit human authorship (Abolghasemi et al., 2024).

For autoregressive LLMs, counterfactual interventions must remain in-distribution. The faithfulness-evaluation paper argues that token removal or corruption produces out-of-distribution inputs for autoregressive models, and proposes counterfactual generation instead. Using NLL-based OOD detection, it reports that editor-generated counterfactuals are approximately 1–5% OOD for the instruct-tuned predictor in one setting, while naive replacements such as <unk> or <mask> can be far more OOD (Kamahi et al., 2024).

The strongest negative result comes from the study of Counterfactual Credit Attribution for autoregressive models. It proves that CCA does not compose autoregressively: there exists a credit-attributing next-token predictor satisfying $\tau_{\mathrm{IF}(b\mid q)} := -\,\frac{1}{N}\, g_q^\top\, H^{-1}\, g_b,$ 3-CCA whose credit-attributing rollout is not $\tau_{\mathrm{IF}(b\mid q)} := -\,\frac{1}{N}\, g_q^\top\, H^{-1}\, g_b,$ 4-CCA for any $\tau_{\mathrm{IF}(b\mid q)} := -\,\frac{1}{N}\, g_q^\top\, H^{-1}\, g_b,$ 5 and $\tau_{\mathrm{IF}(b\mid q)} := -\,\frac{1}{N}\, g_q^\top\, H^{-1}\, g_b,$ 6. It also proves that black-box CCA-Retrofit can require query complexity

$\tau_{\mathrm{IF}(b\mid q)} := -\,\frac{1}{N}\, g_q^\top\, H^{-1}\, g_b,$ 7

for outputs of length $\tau_{\mathrm{IF}(b\mid q)} := -\,\frac{1}{N}\, g_q^\top\, H^{-1}\, g_b,$ 8 (Cohen et al., 2 May 2026).

These results rule out two natural simplifications. First, token-level credit guarantees do not automatically yield sequence-level credit guarantees. Second, strong sequence-level crediting cannot in general be retrofitted efficiently onto an arbitrary autoregressive model using only black-box access.

A plausible implication is that practical SCA systems for autoregressive generation will need relaxations such as $\tau_{\mathrm{IF}(b\mid q)} := -\,\frac{1}{N}\, g_q^\top\, H^{-1}\, g_b,$ 9, approximate augmentation, non-black-box access, or architectural designs in which source influence is explicit at the sequence level rather than inferred post hoc.

7. Open methodological tensions and research directions

Recent work identifies several recurring tensions in SCA.

One tension is granularity versus tractability. CausalFlow focuses on single-step interventions because combinatorial multi-step interventions are expensive, even though some failures may involve interacting steps (Bonagiri et al., 25 May 2026). ATLAS avoids per-instance RL attribution by operating at the source level because full RL attribution is “global and highly entangled” (Huang et al., 26 May 2026). Training-data attribution similarly moves from exact leave-one-out retraining to first-order reweighting or approximate unrolling because exact counterfactuals are too costly (Bae et al., 2024).

A second tension is causal fidelity versus computational budget. CausalFlow requires $S$ 0 intervention proposals per step plus downstream re-execution, and ATLAS requires one RL run per atomic source; in the reported setup, this means 20 atomic sources and multi-day runs per source (Huang et al., 26 May 2026). Forward-only influence-style attribution shifts computation from inference to simulation, explicitly targeting deployment regimes where attribution must be served for many queries over a fixed set of sources (Ma et al., 25 Nov 2025).

A third tension is locality versus completeness. CausalFlow emphasizes minimal repairs and localized edits; RAGONITE removes one cluster at a time rather than exploring all subsets; and the faithfulness-evaluation work measures how small a token-level intervention can still flip the prediction (Bonagiri et al., 25 May 2026). This suggests that many SCA methods are optimized for identifying compact, actionable counterfactual sources rather than for exhaustively decomposing all interacting causes.

A fourth tension is formal credit guarantees versus usable systems. The CCA results for autoregressive models show that worst-case guarantees can be non-compositional and black-box retrofitting can be exponentially hard (Cohen et al., 2 May 2026). At the same time, empirical systems such as RAGONITE and CausalFlow demonstrate that restricted, operational counterfactual procedures can still deliver useful source-level explanations or repairs in practice (Roy et al., 2024).

Taken together, the current literature presents SCA as a technically heterogeneous but conceptually coherent field. Its central commitment is interventionist: attribution should be assigned to a source only when an explicit counterfactual change to that source produces a meaningful change in outcome. The remaining research problem is not whether this principle is useful, but how to realize it at scale, at the appropriate granularity, and with guarantees that survive the sequential and stochastic structure of modern generative systems.