Upstream Attribution (Path Patching)
- Upstream attribution (path patching) is a method that systematically intervenes in neural network pathways to measure causal effects.
- It generalizes beyond single-site interventions to reveal distributed, multi-step causal circuits and quantifies the proportion explained by specific paths.
- Techniques like AtP, RelP, DPA, and APP provide efficient approximations and robust metrics for circuit discovery in complex models.
Upstream Attribution (Path Patching) refers to a class of mechanistic interpretability techniques for neural networks, especially transformers, that quantify the causal impact of information flow along specific upstream computational pathways or modules. The method systematically intervenes in the internals of the model—often by replacing activations along paths or at nodes—and measures the effect on outputs. Path patching generalizes beyond single-site interventions to reveal how distributed, multi-step pathways mediate specific behaviors, thereby enabling fine-grained causal circuit discovery.
1. Definition, Formalization, and Basic Protocol
The foundational operation in upstream attribution by path patching is the causal intervention along candidate paths in a computational graph. Consider a model with directed acyclic computational graph , mapping inputs to outputs . A “path” is a root-to-leaf sequence in $\mathcal{G$, typically corresponding to a semantic unit (e.g., all computations through a specific attention head across time, or an MLP block at a token position) (Goldowsky-Dill et al., 2023).
Patch operator: For a hypothesis where is a set of candidate “important” paths, a metric on outputs, and a distribution over (reference, counterfactual) input pairs, the patched forward pass is defined by replacing, for each path, the reference activations if in 0 and otherwise substituting counterfactual activations. Formally,
1
where 2 is the treeified forward function exposing each path as an independent “leaf” (Goldowsky-Dill et al., 2023).
The causal effect of retaining 3 is measured by, for example, the average unexplained effect (AUE),
4
with corresponding proportion explained metric:
5
where 6 is the effect of ablating all paths (Goldowsky-Dill et al., 2023).
Algorithmic steps include: treeification of 7, selection of candidate paths, efficient caching and intervention, and statistical testing via Monte Carlo sampling.
2. Field-Theoretic and Linear Response Framework
Recent formalizations extend path patching to a continuous-depth, field-theoretic setting. The residual stream is treated as a field 8 over depth (9) and token position (0). The transformer is described by a depth-evolution partial differential equation (PDE):
1
where 2 encodes the blockwise update (Olivieri et al., 24 May 2026).
Intervening via patching corresponds to introducing a localized source term 3:
4
with 5 instantiated as a delta function at the patch site, projecting the source-run difference into the targeted subspace (e.g., head or token).
First-order linear response predicts patch effects:
- Sensitivity field 6 (functional derivative of observable 7)
- Predicted patch effect: 8
- Green-function 9 for downstream propagation. The field shift at 0 induced by 1 is
2
This linearized regime allows efficient, predictive simulation of upstream interventions and is empirically validated to be accurate within perturbative bounds (3 error) (Olivieri et al., 24 May 2026).
3. Relation to Attribution, Relevance, and Efficient Methods
While exhaustive activation patching is faithful but computationally expensive (linear in the number of candidate sites), various gradient-based approximations have been developed.
- Attribution patching (AtP) uses a first-order Taylor approximation:
4
This is efficient (2 forward + 1 backward pass) but suffers from failure modes such as attention softmax saturation and cancellation (Kramár et al., 2024).
- AtP* augments AtP with (i) QK-fix for nonlinearity at attention queries/keys, and (ii) GradDrop to mitigate direct/indirect path cancellations. Subset sampling is used to bound residual false negatives (Kramár et al., 2024).
- Relevance patching (RelP) replaces raw gradients in AtP with propagation coefficients from Layer-wise Relevance Propagation (LRP), which redistributes output relevance backward through the network according to architecture-specific rules, ensuring conservation and improving faithfulness, especially for MLPs and deep models:
5
where 6 is the LRP coefficient at 7 (Jafari et al., 28 Aug 2025).
- Dual Path Attribution (DPA) analytically linearizes all major computational paths (attention heads and SwiGLU-FFN neurons) and propagates the unembedding target vector via pathwise inversion, attaining 8 cost per component and outperforming earlier counterfactual methods in efficiency and accuracy. No counterfactuals are required (Jantsch et al., 20 Mar 2026).
- Accelerated Path Patching (APP) employs causal-mediation-style pruning (Contrastive-FLAP) before path patching, drastically reducing the search among heads, with up to 9 runtime reduction compared to naive path patching, and near-identical causal circuits (Andersen et al., 7 Nov 2025).
4. Application Domains and Empirical Findings
Upstream attribution (path patching) is central in mechanistic circuit discovery for language modeling, e.g., in indirect object identification (IOI), syntactic agreement, number heuristics, or instruction-following tasks.
- In circuit recovery, methods such as Edge Attribution Patching (EAP), AtP*, and RelP outperform prior brute-force or KL-divergence-based methods in area under curve (AUC) for recall of “ground-truth” circuits, while maintaining dramatic improvements in computational cost (Syed et al., 2023, Kramár et al., 2024, Jafari et al., 28 Aug 2025).
- For multi-module LLM agents, path patching highlights the “diagnosis is not prescription” paradox: the module with highest causal blame is not always the optimal intervention site, due to downstream adaptation and linguistic co-adaptation (Jeonghun et al., 21 May 2026).
- Cross-patching between pre-trained and instruction-tuned checkpoints at the first divergence token reveals the interaction between upstream (early-layer) state and late-layer readout; most functional effect of late stacks is realized only when reading their own upstream state (Zhou, 8 May 2026).
5. Causal Interpretation, Statistical Metrics, and Circuit Minimality
Path patching provides rigorous, quantitative metrics for the sufficiency of hypothesized pathways or modules:
- Sufficiency (Average Unexplained Effect): The minimal set of paths 0 can be tested for sufficiency in mediating behavior via AUE or “proportion explained” (Goldowsky-Dill et al., 2023).
- Faithfulness: Ability of an approximation (e.g., RelP, AtP, APP) to recover the same ranking or circuit as full activation patching is measured by Pearson correlation, AUC, and ablation fidelity (Jafari et al., 28 Aug 2025, Syed et al., 2023, Andersen et al., 7 Nov 2025).
- Minimality: Circuits are constrained so that removal of any included head or component would decrease performance below a prespecified threshold, enforced during iterative path patching (Andersen et al., 7 Nov 2025).
Statistical controls include confidence intervals for AUE and subset sampling to bound residual false negatives in approximate methods (Goldowsky-Dill et al., 2023, Kramár et al., 2024).
6. Limitations, Assumptions, and Extensions
Known limitations include:
- Local linear approximation errors in highly nonlinear modules (e.g., attention softmax, deep residual mixing) (Kramár et al., 2024, Olivieri et al., 24 May 2026).
- Computation/memory cost for full intervention sweeps in large models if insufficiently pruned (Andersen et al., 7 Nov 2025).
- Approximations miss latent “zero-gradient” components whose effects are revealed only by larger-than-infinitesimal perturbations (Goldowsky-Dill et al., 2023, Syed et al., 2023).
- Assumption of correctly specified perturbative regime; outside the local linear band, error rates increase (Olivieri et al., 24 May 2026).
Proposed extensions include: integrating with automated hypothesis search, adversarial example generation, distributional shift detection, and scaling to very large model families with architecture-specific analytic inversion (Goldowsky-Dill et al., 2023, Andersen et al., 7 Nov 2025, Jantsch et al., 20 Mar 2026).
7. Cross-Domain Path Patching and Generalizations
While most upstream attribution via path patching has focused on mechanistic interpretability in neural LLMs, parallel formalizations exist in other domains. For example, in marketing attribution, the “removal effect” is defined analogously, quantifying the direct and indirect conversion effect of touchpoint/path removal within a learned Granger-causality graph of events (Tao et al., 2023). This indicates the generality of path-based causal attribution frameworks, whether instantiated as node/edge interventions in deep networks or “thinning” events in point-process models of customer behavior.
Table: Core Upstream Attribution Methods and Properties
| Method | Computational Cost | Faithfulness (Pearson r) | Distinctive Feature |
|---|---|---|---|
| Activation Patching | 1 forwards | 1.0 (ground truth) | Actual activation swap/intervention |
| Attribution Patch (AtP) | 2 forwards, 3 backward | 0.006–0.74 (MLP–resid (Jafari et al., 28 Aug 2025)) | First-order gradient approx. |
| RelP (Relevance Patch) | 4 forwards, 5 backward | 0.853–0.965 (Jafari et al., 28 Aug 2025) | Layer-wise relevance propagation |
| DPA (Dual Path Attr.) | 6 forward, 7 backward | 8 (Jantsch et al., 20 Mar 2026) | Analytic linearization, O(1) |
| APP (Accelerated PP) | Pruning + PP on 945% heads | $\mathcal{G$0 overlap (Andersen et al., 7 Nov 2025) | Contrastive pruning + iterative PP |
References
- "Continuous-Depth Field Theory for Transformer Patching and Mechanistic Interpretability" (Olivieri et al., 24 May 2026)
- "RelP: Faithful and Efficient Circuit Discovery via Relevance Patching" (Jafari et al., 28 Aug 2025)
- "Localizing Model Behavior with Path Patching" (Goldowsky-Dill et al., 2023)
- "APP: Accelerated Path Patching with Task-Specific Pruning" (Andersen et al., 7 Nov 2025)
- "AtP*: An efficient and scalable method for localizing LLM behaviour to components" (Kramár et al., 2024)
- "Attribution Patching Outperforms Automated Circuit Discovery" (Syed et al., 2023)
- "Dual Path Attribution: Efficient Attribution for SwiGLU-Transformers through Layer-Wise Target Propagation" (Jantsch et al., 20 Mar 2026)
- "A Graphical Point Process Framework for Understanding Removal Effects in Multi-Touch Attribution" (Tao et al., 2023)
- "Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic" (Zhou, 8 May 2026)
- "Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines" (Jeonghun et al., 21 May 2026)