Mapping circuit-broken representations to semantic directions
Develop and evaluate a circuit-breaking rerouting strategy that maps the circuit-broken representation onto semantically meaningful directions in the model’s residual stream—specifically a refusal-direction vector or the end-of-sequence (EOS) token embedding—to assess its impact on robust alignment and capability retention.
Sponsor
References
Additionally, one could map \text{rep}_\text{c/b} onto more semantically meaningful directions, such as a refusal direction or the embedding of the EOS token. We leave this to future work.
— Improving Alignment and Robustness with Circuit Breakers
(2406.04313 - Zou et al., 6 Jun 2024) in Section 3 (Circuit Breaking with Representation Engineering), Loss paragraph