Explain why representation hijacking bypasses refusal mechanisms

Determine the mechanistic reason that the Doublespeak in-context representation hijacking—where benign tokens gradually acquire harmful semantics across transformer layers—bypasses the refusal mechanism in safety-aligned large language models.

Background

The paper shows that substituting harmful keywords with benign euphemisms in context can progressively overwrite token semantics, yet the model’s refusal mechanisms often fail to trigger. The authors hypothesize possible causes (early-layer checks or superposed representations) but explicitly note the underlying reason is unclear, marking a concrete mechanistic gap.

References

Nevertheless, it is not clear why this behavior bypasses the refusal mechanism in aligned LLMs.

In-Context Representation Hijacking (2512.03771 - Yona et al., 3 Dec 2025) in Section 3.4, Bypassing Model Refusal