Papers
Topics
Authors
Recent
Search
2000 character limit reached

Upstream Attribution (Path Patching)

Updated 22 June 2026
  • Upstream attribution (path patching) is a method that systematically intervenes in neural network pathways to measure causal effects.
  • It generalizes beyond single-site interventions to reveal distributed, multi-step causal circuits and quantifies the proportion explained by specific paths.
  • Techniques like AtP, RelP, DPA, and APP provide efficient approximations and robust metrics for circuit discovery in complex models.

Upstream Attribution (Path Patching) refers to a class of mechanistic interpretability techniques for neural networks, especially transformers, that quantify the causal impact of information flow along specific upstream computational pathways or modules. The method systematically intervenes in the internals of the model—often by replacing activations along paths or at nodes—and measures the effect on outputs. Path patching generalizes beyond single-site interventions to reveal how distributed, multi-step pathways mediate specific behaviors, thereby enabling fine-grained causal circuit discovery.

1. Definition, Formalization, and Basic Protocol

The foundational operation in upstream attribution by path patching is the causal intervention along candidate paths in a computational graph. Consider a model GG with directed acyclic computational graph G\mathcal{G}, mapping inputs xx to outputs yy. A “path” pp is a root-to-leaf sequence in $\mathcal{G$, typically corresponding to a semantic unit (e.g., all computations through a specific attention head across time, or an MLP block at a token position) (Goldowsky-Dill et al., 2023).

Patch operator: For a hypothesis H=(G,δ,P,D)H=(\mathcal{G},\delta,P,D) where PP is a set of candidate “important” paths, δ\delta a metric on outputs, and DD a distribution over (reference, counterfactual) input pairs, the patched forward pass is defined by replacing, for each path, the reference activations if in G\mathcal{G}0 and otherwise substituting counterfactual activations. Formally,

G\mathcal{G}1

where G\mathcal{G}2 is the treeified forward function exposing each path as an independent “leaf” (Goldowsky-Dill et al., 2023).

The causal effect of retaining G\mathcal{G}3 is measured by, for example, the average unexplained effect (AUE),

G\mathcal{G}4

with corresponding proportion explained metric:

G\mathcal{G}5

where G\mathcal{G}6 is the effect of ablating all paths (Goldowsky-Dill et al., 2023).

Algorithmic steps include: treeification of G\mathcal{G}7, selection of candidate paths, efficient caching and intervention, and statistical testing via Monte Carlo sampling.

2. Field-Theoretic and Linear Response Framework

Recent formalizations extend path patching to a continuous-depth, field-theoretic setting. The residual stream is treated as a field G\mathcal{G}8 over depth (G\mathcal{G}9) and token position (xx0). The transformer is described by a depth-evolution partial differential equation (PDE):

xx1

where xx2 encodes the blockwise update (Olivieri et al., 24 May 2026).

Intervening via patching corresponds to introducing a localized source term xx3:

xx4

with xx5 instantiated as a delta function at the patch site, projecting the source-run difference into the targeted subspace (e.g., head or token).

First-order linear response predicts patch effects:

  • Sensitivity field xx6 (functional derivative of observable xx7)
  • Predicted patch effect: xx8
  • Green-function xx9 for downstream propagation. The field shift at yy0 induced by yy1 is

yy2

This linearized regime allows efficient, predictive simulation of upstream interventions and is empirically validated to be accurate within perturbative bounds (yy3 error) (Olivieri et al., 24 May 2026).

3. Relation to Attribution, Relevance, and Efficient Methods

While exhaustive activation patching is faithful but computationally expensive (linear in the number of candidate sites), various gradient-based approximations have been developed.

yy4

This is efficient (2 forward + 1 backward pass) but suffers from failure modes such as attention softmax saturation and cancellation (Kramár et al., 2024).

  • AtP* augments AtP with (i) QK-fix for nonlinearity at attention queries/keys, and (ii) GradDrop to mitigate direct/indirect path cancellations. Subset sampling is used to bound residual false negatives (Kramár et al., 2024).
  • Relevance patching (RelP) replaces raw gradients in AtP with propagation coefficients from Layer-wise Relevance Propagation (LRP), which redistributes output relevance backward through the network according to architecture-specific rules, ensuring conservation and improving faithfulness, especially for MLPs and deep models:

yy5

where yy6 is the LRP coefficient at yy7 (Jafari et al., 28 Aug 2025).

  • Dual Path Attribution (DPA) analytically linearizes all major computational paths (attention heads and SwiGLU-FFN neurons) and propagates the unembedding target vector via pathwise inversion, attaining yy8 cost per component and outperforming earlier counterfactual methods in efficiency and accuracy. No counterfactuals are required (Jantsch et al., 20 Mar 2026).
  • Accelerated Path Patching (APP) employs causal-mediation-style pruning (Contrastive-FLAP) before path patching, drastically reducing the search among heads, with up to yy9 runtime reduction compared to naive path patching, and near-identical causal circuits (Andersen et al., 7 Nov 2025).

4. Application Domains and Empirical Findings

Upstream attribution (path patching) is central in mechanistic circuit discovery for language modeling, e.g., in indirect object identification (IOI), syntactic agreement, number heuristics, or instruction-following tasks.

  • In circuit recovery, methods such as Edge Attribution Patching (EAP), AtP*, and RelP outperform prior brute-force or KL-divergence-based methods in area under curve (AUC) for recall of “ground-truth” circuits, while maintaining dramatic improvements in computational cost (Syed et al., 2023, Kramár et al., 2024, Jafari et al., 28 Aug 2025).
  • For multi-module LLM agents, path patching highlights the “diagnosis is not prescription” paradox: the module with highest causal blame is not always the optimal intervention site, due to downstream adaptation and linguistic co-adaptation (Jeonghun et al., 21 May 2026).
  • Cross-patching between pre-trained and instruction-tuned checkpoints at the first divergence token reveals the interaction between upstream (early-layer) state and late-layer readout; most functional effect of late stacks is realized only when reading their own upstream state (Zhou, 8 May 2026).

5. Causal Interpretation, Statistical Metrics, and Circuit Minimality

Path patching provides rigorous, quantitative metrics for the sufficiency of hypothesized pathways or modules:

  • Sufficiency (Average Unexplained Effect): The minimal set of paths pp0 can be tested for sufficiency in mediating behavior via AUE or “proportion explained” (Goldowsky-Dill et al., 2023).
  • Faithfulness: Ability of an approximation (e.g., RelP, AtP, APP) to recover the same ranking or circuit as full activation patching is measured by Pearson correlation, AUC, and ablation fidelity (Jafari et al., 28 Aug 2025, Syed et al., 2023, Andersen et al., 7 Nov 2025).
  • Minimality: Circuits are constrained so that removal of any included head or component would decrease performance below a prespecified threshold, enforced during iterative path patching (Andersen et al., 7 Nov 2025).

Statistical controls include confidence intervals for AUE and subset sampling to bound residual false negatives in approximate methods (Goldowsky-Dill et al., 2023, Kramár et al., 2024).

6. Limitations, Assumptions, and Extensions

Known limitations include:

Proposed extensions include: integrating with automated hypothesis search, adversarial example generation, distributional shift detection, and scaling to very large model families with architecture-specific analytic inversion (Goldowsky-Dill et al., 2023, Andersen et al., 7 Nov 2025, Jantsch et al., 20 Mar 2026).

7. Cross-Domain Path Patching and Generalizations

While most upstream attribution via path patching has focused on mechanistic interpretability in neural LLMs, parallel formalizations exist in other domains. For example, in marketing attribution, the “removal effect” is defined analogously, quantifying the direct and indirect conversion effect of touchpoint/path removal within a learned Granger-causality graph of events (Tao et al., 2023). This indicates the generality of path-based causal attribution frameworks, whether instantiated as node/edge interventions in deep networks or “thinning” events in point-process models of customer behavior.

Table: Core Upstream Attribution Methods and Properties

Method Computational Cost Faithfulness (Pearson r) Distinctive Feature
Activation Patching pp1 forwards 1.0 (ground truth) Actual activation swap/intervention
Attribution Patch (AtP) pp2 forwards, pp3 backward 0.006–0.74 (MLP–resid (Jafari et al., 28 Aug 2025)) First-order gradient approx.
RelP (Relevance Patch) pp4 forwards, pp5 backward 0.853–0.965 (Jafari et al., 28 Aug 2025) Layer-wise relevance propagation
DPA (Dual Path Attr.) pp6 forward, pp7 backward pp8 (Jantsch et al., 20 Mar 2026) Analytic linearization, O(1)
APP (Accelerated PP) Pruning + PP on pp945% heads $\mathcal{G$0 overlap (Andersen et al., 7 Nov 2025) Contrastive pruning + iterative PP

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Upstream Attribution (Path Patching).