Assess adequacy of gradient-based attribution patching for circuit discovery

Establish whether gradient-based attribution patching methods, including AtP*, provide adequate approximations to activation patching for identifying task-relevant components and circuits in large language models, and determine the conditions under which such approximations are faithful.

Background

Activation patching is a gold-standard causal intervention used to identify task-relevant components but can be prohibitively expensive at scale. Gradient-based attribution patching has been proposed as a faster approximation to prioritize candidates for intervention.

The paper highlights that these methods are only first‑order approximations and raises doubts about their faithfulness when compared to exact causal interventions. Clarifying adequacy and failure modes is crucial for scaling automated circuit discovery.

References

However, attribution patching uses gradients, which only yield a first-order approximation of the effect of ablating components, leaving it unclear whether this method and any improvements on it produce adequate approximations.

— Open Problems in Mechanistic Interpretability (2501.16496 - Sharkey et al., 27 Jan 2025) in Proceduralizing mechanistic interpretability into circuit discovery pipelines — Scalable methods are only approximate (Section 2.3, bullet “Scalable methods are only approximate.”)

Assess adequacy of gradient-based attribution patching for circuit discovery

Sponsor

Background

References

Related Problems