Distinguish activation steering from deployment-like prompting in the simple SDF-only model
Determine whether activation steering effects observed on the simple synthetic-document-fine-tuned model organism are causally distinct from those produced by prompt-based manipulations that explicitly state deployment status, or whether activation steering is effectively equivalent to deployment prompting in that setting.
References
Similar to the experiment in Figure \ref{fig:emojis}, we cannot rule out the alternative hypothesis that activation steering in this case is equivalent to telling the model in the prompt that it is deployed. This is why we developed the Wood Labs model.
— Steering Evaluation-Aware Language Models to Act Like They Are Deployed
(2510.20487 - Hua et al., 23 Oct 2025) in Appendix — Activation Steering Works on Simpler Model Organism