Explain why representation hijacking bypasses refusal mechanisms
Determine the mechanistic reason that the Doublespeak in-context representation hijacking—where benign tokens gradually acquire harmful semantics across transformer layers—bypasses the refusal mechanism in safety-aligned large language models.
Sponsor
References
Nevertheless, it is not clear why this behavior bypasses the refusal mechanism in aligned LLMs.
— In-Context Representation Hijacking
(2512.03771 - Yona et al., 3 Dec 2025) in Section 3.4, Bypassing Model Refusal