Identify which activation steering vector reflects genuine deployment behavior

Determine, for Anthropic’s Claude Sonnet 4.5, which activation-steering vector among those that suppress verbalized evaluation awareness actually elicits the model’s genuine deployment behavior, and ascertain that the intervention recovers the true deployment policy rather than merely suppressing mentions of evaluation status.

Background

Prior work on Claude Sonnet 4.5 found multiple activation steering vectors that suppress the model’s tendency to verbalize evaluation awareness, but these vectors had mixed effects on measured misalignment rates. This creates ambiguity about whether any given steering intervention is recovering the model’s true deployment behavior or just masking indications of evaluation awareness in its chain-of-thought.

The paper motivates the need for validation methods by highlighting that, without knowing which steering vector (if any) corresponds to genuine deployment behavior, safety evaluators cannot reliably interpret the results of alignment evaluations on evaluation-aware models.

References

We do not know which one, if any, represents the model's ``true'' deployment behavior. How can we be sure that our steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?

— Steering Evaluation-Aware Language Models to Act Like They Are Deployed (2510.20487 - Hua et al., 23 Oct 2025) in Introduction

Identify which activation steering vector reflects genuine deployment behavior

Background

References

Related Problems