Explain why zero-activation patching outperforms corrupted-activation patching
Determine the underlying cause of the empirical finding that activation patching with zero activations often yields lower loss and better subgraph recovery than activation patching with activations from corrupted inputs when applying Automatic Circuit DisCovery (ACDC), Subnetwork Probing (SP), and Head Importance Score for Pruning (HISP) to transformer models, and ascertain whether "negative" components that harm task performance are responsible for this effect and under what conditions each patching scheme is preferable.
References
In the main text experiments that compared using corrupted activations and zero activations (Fig. ... ), all three methods recovered subgraphs with generally lower loss when doing activation patching with zeros, in both the experiments with the normal model and with permuted weights. It is unclear why the methods achieve better results with corruptions that are likely to be more destructive.