Understand Subnetwork Probing’s unexpected selections in the OR-gate toy model

Ascertain why Subnetwork Probing identifies the key and value inputs of attention head a0.0 in the toy transformer OR-gate model designed to implement a simple OR gate, despite the ground-truth circuit requiring only the two attention head outputs into the downstream MLP, and characterize the factors that cause SP to include these additional inputs.

Background

To analyze limitations of automated circuit discovery methods, the authors construct a toy transformer that implements an OR gate using two attention heads feeding into an MLP. In this setup, the ground-truth minimal circuit consists only of the two head outputs into the MLP.

When applying different methods, ACDC typically retains only one of the two inputs (due to iterative pruning), HISP misses both, and SP unexpectedly includes additional components, namely the key and value inputs of one head. The authors explicitly state they are unsure why SP selects these inputs, highlighting an unresolved methodological behavior in SP that warrants explanation.

References

We are unsure why SP finds the a0.0's key and value inputs.

— Towards Automated Circuit Discovery for Mechanistic Interpretability (2304.14997 - Conmy et al., 2023) in Appendix: Automated Circuit Discovery and OR gates

Understand Subnetwork Probing’s unexpected selections in the OR-gate toy model

Background

References

Related Problems