Characterize refusal mechanisms at the representation level
Determine how refusal mechanisms in safety-aligned transformer-based large language models operate in terms of their underlying internal token representations across layers, clarifying the representational implementation of refusal beyond activation-space directions.
Sponsor
References
While recent works identified refusal directions that emerge in the activation space \citep{arditi2024refusal}, it remains unclear how these refusal mechanisms operate in terms of the underlying representations. This is especially crucial since those representations can be changed in-context as the result of user prompts \citep{park2025iclr}.
— In-Context Representation Hijacking
(2512.03771 - Yona et al., 3 Dec 2025) in Section 1, Introduction