Characterize refusal mechanisms at the representation level

Determine how refusal mechanisms in safety-aligned transformer-based large language models operate in terms of their underlying internal token representations across layers, clarifying the representational implementation of refusal beyond activation-space directions.

Background

Prior work has identified that refusal behavior in aligned LLMs can be mediated by specific directions in activation space. However, the representational basis of refusal—how internal token representations encode and trigger refusal across layers—has not been clearly described. The paper motivates this gap and studies representation hijacking during inference, highlighting the need to understand refusal at the representation level.

References

While recent works identified refusal directions that emerge in the activation space \citep{arditi2024refusal}, it remains unclear how these refusal mechanisms operate in terms of the underlying representations. This is especially crucial since those representations can be changed in-context as the result of user prompts \citep{park2025iclr}.

In-Context Representation Hijacking (2512.03771 - Yona et al., 3 Dec 2025) in Section 1, Introduction