Explain why zero-activation patching outperforms corrupted-activation patching

Determine the underlying cause of the empirical finding that activation patching with zero activations often yields lower loss and better subgraph recovery than activation patching with activations from corrupted inputs when applying Automatic Circuit DisCovery (ACDC), Subnetwork Probing (SP), and Head Importance Score for Pruning (HISP) to transformer models, and ascertain whether "negative" components that harm task performance are responsible for this effect and under what conditions each patching scheme is preferable.

Background

The paper compares two activation patching schemes widely used in mechanistic interpretability: zero activations (setting targeted activations to zero) and corrupted activations (interchange interventions that substitute activations from matched inputs lacking the behavior). Across multiple tasks and methods (ACDC, SP, HISP), the authors observe that zero activation patching can yield lower loss than corrupted activations, even though zeroing is intuitively more destructive.

The authors note a potential hypothesis that zero activations might better disrupt "negative" components—internal units whose contributions reduce performance on the target behavior—but emphasize that the underlying reason for the observed advantage of zero activations remains unclear. Clarifying this mechanism would guide principled choices of patching schemes in automated circuit discovery.

References

In the main text experiments that compared using corrupted activations and zero activations (Fig. ... ), all three methods recovered subgraphs with generally lower loss when doing activation patching with zeros, in both the experiments with the normal model and with permuted weights. It is unclear why the methods achieve better results with corruptions that are likely to be more destructive.

— Towards Automated Circuit Discovery for Mechanistic Interpretability (2304.14997 - Conmy et al., 2023) in Appendix: Activation patching with zeros, instead of corrupted input (Section app:roc_zero_activations)

Explain why zero-activation patching outperforms corrupted-activation patching

Background

References

Related Problems