Composition of circuits within Transformer blocks

Determine how multiple circuits within a single Transformer block compose to produce the block’s additive update to the residual stream in Transformers, in order to enable precise counterfactual interventions on individual intermediate variables.

Background

Circuit probing uncovers sparse subnetworks (circuits) that compute hypothesized intermediate variables inside Transformer attention and MLP blocks. However, while the method supports causal ablations, it cannot currently support counterfactual replacements of variables because the manner in which multiple circuits jointly contribute to a block’s additive residual update is not characterized.

Establishing a concrete compositional model for how these circuits combine within a block would close this gap and enable targeted counterfactual interventions, advancing mechanistic interpretability beyond ablations to controlled variable manipulation.

References

It is currently unknown how multiple circuits compose within a given block to create one additive update to the residual stream, so one cannot replace individual variables to elicit counterfactual behavior.

— Uncovering Intermediate Variables in Transformers using Circuit Probing (2311.04354 - Lepori et al., 2023) in Discussion, Limitations

Composition of circuits within Transformer blocks

Background

References

Related Problems