Assess adequacy of gradient-based attribution patching for circuit discovery
Establish whether gradient-based attribution patching methods, including AtP*, provide adequate approximations to activation patching for identifying task-relevant components and circuits in large language models, and determine the conditions under which such approximations are faithful.
References
However, attribution patching uses gradients, which only yield a first-order approximation of the effect of ablating components, leaving it unclear whether this method and any improvements on it produce adequate approximations.
— Open Problems in Mechanistic Interpretability
(2501.16496 - Sharkey et al., 27 Jan 2025) in Proceduralizing mechanistic interpretability into circuit discovery pipelines — Scalable methods are only approximate (Section 2.3, bullet “Scalable methods are only approximate.”)