Papers
Topics
Authors
Recent
Search
2000 character limit reached

Activation Patching & Automated Circuit Discovery

Updated 12 March 2026
  • Activation patching and ACDC are methodologies that use targeted causal interventions to identify and extract sparse, task-critical circuits in transformer models.
  • Edge attribution techniques like EAP, EAP-IG, and EAP-GP balance computational efficiency with faithfulness, measured by metrics such as NFS and sparsity levels.
  • Innovative extensions such as position-aware patching, mixed-precision acceleration, and formal robustness guarantees advance circuit discovery for scalable mechanistic interpretability.

Activation patching and Automated Circuit Discovery (ACDC) are central methodologies in mechanistic interpretability for reverse-engineering the internal mechanisms of large neural networks, particularly transformer-based LLMs. Activation patching performs targeted causal interventions to identify functionally critical subgraphs, while ACDC systematizes and automates subgraph discovery to yield sparse, task-relevant "circuits"—minimal subnetworks that preserve key model behaviors. Extensions include algorithmic accelerations, position-aware variants, provable robustness guarantees, and integration with gradient- and relevance-based attribution methods, collectively supporting the scaling of mechanistic insights to ever-larger models and more complex reasoning phenomena.

1. Foundations: Activation Patching and Circuit Discovery

Activation patching is a causal intervention technique for mechanistic interpretability. Given two inputs, a "clean" example xcleanx_{clean} and a "corrupted" example xcorrx_{corr}, activation patching evaluates the effect of replacing the activation ecleane_{clean} at a specified component or edge EE with ecorre_{corr} from the corrupted input, quantifying the change in a downstream metric LL (e.g., logit difference or loss):

Activation-Patching score(E)=∣L(do(E=ecorr))−L(do(E=eclean))∣\text{Activation-Patching score}(E) = |L(\mathrm{do}(E=e_{corr})) - L(\mathrm{do}(E=e_{clean}))|

Automated Circuit Discovery (ACDC) aims to identify a minimal subgraph (the "circuit") C⊆GC \subseteq G of a model's computation graph GG such that CC suffices to preserve task-relevant behaviors under patching interventions. The core ACDC algorithm prunes edges greedily: for each edge (w→v)(w \to v), the change in LL upon patching is measured, and edges with impact below threshold τ\tau are pruned. The result is a sparse circuit consistent with the original model's behavior on designated tasks (Conmy et al., 2023).

2. Edge Attribution and Gradient-Based Approximations

Activation patching as originally formulated is computationally expensive, scaling linearly with the number of components due to required repeated forward passes. Edge Attribution Patching (EAP) introduces a first-order Taylor approximation that estimates the impact of corrupting EE:

ΔeL≈(ecorr−eclean)⊤∂L∂e∣eclean\Delta_e L \approx (e_{corr} - e_{clean})^\top \frac{\partial L}{\partial e}|_{e_{clean}}

This enables computation of all edge importances with two forward passes and one backward pass (Syed et al., 2023). However, basic gradient-based methods can suffer from the zero-gradient problem and saturation effects, rendering component significance estimates unreliable. To address these limitations, EAP-GP (Edge Attribution Patching with GradPath) adaptively constructs integration paths in activation space, steering around saturation regions by following the maximal change in ∥G(⋅)−G(⋅,xu′)∥22\|G(\cdot) - G(\cdot, x_u')\|_2^2, thus accumulating more informative gradients and further improving faithfulness of circuit identification (Zhang et al., 7 Feb 2025).

Method Faithfulness (NFS, IOI) Compute Time (IOI, 97.5% sparsity)
EAP 56.9% 12 s
EAP-IG 62.4% 50 s
EAP-GP 80.1% 230 s

3. Algorithmic Extensions and Position Awareness

Recent work challenges the position-invariance assumption in prior circuit discovery: components do not generally contribute equally at all sequence positions. Position-aware Automatic Circuit Discovery (PEAP) refines edge attribution patching by differentiating edges not only by component type but also by token position, enabling the isolation of cross-positional mechanisms. Variable-length inputs are addressed via dataset schemas—abstract mappings that cluster token spans with equivalent semantic roles—which allow edge importances to be consistently aggregated and circuits to be discovered in non-templatic prompts. Empirically, position-aware methods achieve equivalent faithfulness with 10–20×\times fewer edges than position-agnostic baselines (Haklay et al., 7 Feb 2025).

4. Efficient and Faithful Circuit Discovery: Alternatives and Accelerations

Multiple algorithmic innovations address the scalability and faithfulness bottlenecks inherent to classical ACDC:

  • Contextual Decomposition for Transformers (CD-T): Computes relevant/irrelevant decompositions of activations algebraically in a single forward pass, eliminating patching or gradients and doubling efficiency over vanilla ACDC, while maintaining comparable recovery fidelity (Hsu et al., 2024).
  • Dictionary Learning ACDC: Decomposes module outputs into sparse, monosemantic dictionary features, then recursively traces contributions to final outputs, circumventing out-of-distribution issues associated with patching and reducing asymptotic discovery cost (He et al., 2024).
Method Asymptotic Time OOD Shift
Causal Patching O(n)O(n) model forwards Yes
Direct Patching O(n)O(n) module forwards Yes
ACDC-Dict O(n)O(n) module forwards No
  • PAHQ (Mixed-Precision Inference): Accelerates ACDC by dynamically quantizing non-critical components to FP8/bfloat16 while keeping patched heads in full precision during activation patching passes, reducing runtime by ∼80%\sim 80\% and memory by ∼30%\sim 30\% at negligible loss in faithfulness (Wang et al., 27 Oct 2025).
  • DiscoGP: Formulates circuit discovery as a differentiable masking problem, jointly learning sparse parameter and edge masks under a functional faithfulness and completeness objective. This produces circuits with near-perfect task fidelity and minimal overlap with the non-circuit complement (Yu et al., 2024).

5. Robustness, Provable Guarantees, and Subspace Limitations

Classical circuit discovery methods do not guarantee functional equivalence to the original model across broad input domains or under arbitrary interventions. Formal Mechanistic Interpretability extends ACDC with neural network verification, providing circuits that are provably input-robust (faithful on continuous input regions), patching-robust (stable under all admissible counterfactual activations), and minimal in several senses (quasi-, subset-, cardinally-minimal). Experiments demonstrate that provably robust circuits match or surpass the faithfulness of heuristic baselines, albeit at substantially higher computational cost (Hadad et al., 18 Feb 2026).

A parallel controversy concerns subspace activation patching. Empirical and formal results demonstrate that patching aligned with low-dimensional subspaces can produce illusory interpretability: effective interventions may exploit dormant or causally disconnected paths. Faithful feature localization requires diagnostic criteria, including null/row-space decomposition relative to readout weights, alignment with known writers and readers, and cross-distribution validation (Makelov et al., 2023).

6. Relevance-Based and Patch-Free Approaches

Recent advances leverage propagation-based attributions:

  • RelP (Relevance Patching): Replaces local gradients in attribution patching with Layer-wise Relevance Propagation (LRP) coefficients, yielding Pearson correlation ∼0.95\sim 0.95 with gold-standard activation patching on MLP outputs, compared to $0.006$ using gradients (Jafari et al., 28 Aug 2025). RelP inherits gradient efficiency and matches Integrated Gradients in faithfulness at a fraction of computational cost.

Patch-free methods such as ACDC-Dict and CD-T avoid out-of-distribution effects and local approximation errors by operating directly on interpretable features or algebraic decompositions, further improving reliability and scalability in circuit extraction.

7. Practical Implications, Limitations, and Future Directions

Activation patching and ACDC frameworks now underpin much of mechanistic interpretability, driving the automated discovery of sparse, functionally critical subgraphs in transformer architectures used for language, vision, and beyond. Recent innovations address the major bottlenecks in scalability, faithfulness, positional specificity, robustness, and efficiency. Open challenges and future research include:

  • Extending frameworks to trillion-parameter models, chain-of-thought reasoning, and more granular circuit constructs (neurons, weights).
  • Integrating robust patching or regularizers at training time to enforce interpretable architectures.
  • Systematically combining fast first-pass approaches (gradients, relevance, dictionary) with selective patching for optimal faithfulness and compute efficiency.
  • Formal analysis of out-of-distribution effects, generalization properties, and circuit stability across varied data domains.
  • Developing standardized evaluation metrics and ground-truth benchmarks for large scale and real-world settings.

Together, these methodologies constitute a rapidly advancing toolkit for principled reverse engineering of neural mechanisms at unprecedented scale, with clear implications for transparency, safety, and model control in foundation models.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Activation Patching and Automated Circuit Discovery (ACDC).