Mapping circuit-broken representations to semantic directions

Develop and evaluate a circuit-breaking rerouting strategy that maps the circuit-broken representation onto semantically meaningful directions in the model’s residual stream—specifically a refusal-direction vector or the end-of-sequence (EOS) token embedding—to assess its impact on robust alignment and capability retention.

Background

In the description of the Representation Rerouting (RR) loss, the authors propose alternative targets for rerouting harmful representations, suggesting semantically meaningful directions such as refusal vectors or EOS embeddings. They explicitly note that evaluating such mappings is deferred.

This indicates a concrete extension of RR where the target representation is not random or orthogonal but tied to interpretable semantics, potentially improving controllability and robustness.

References

Additionally, one could map \text{rep}_\text{c/b} onto more semantically meaningful directions, such as a refusal direction or the embedding of the EOS token. We leave this to future work.

— Improving Alignment and Robustness with Circuit Breakers (2406.04313 - Zou et al., 2024) in Section 3 (Circuit Breaking with Representation Engineering), Loss paragraph

Mapping circuit-broken representations to semantic directions

Background

References

Related Problems