Identify which activation steering vector reflects genuine deployment behavior
Determine, for Anthropic’s Claude Sonnet 4.5, which activation-steering vector among those that suppress verbalized evaluation awareness actually elicits the model’s genuine deployment behavior, and ascertain that the intervention recovers the true deployment policy rather than merely suppressing mentions of evaluation status.
References
We do not know which one, if any, represents the model's ``true'' deployment behavior. How can we be sure that our steering methods actually elicited the deployment behavior, as opposed to only suppressing verbalizations of being deployed?
— Steering Evaluation-Aware Language Models to Act Like They Are Deployed
(2510.20487 - Hua et al., 23 Oct 2025) in Introduction