Distinguish activation steering from deployment-like prompting in the simple SDF-only model

Determine whether activation steering effects observed on the simple synthetic-document-fine-tuned model organism are causally distinct from those produced by prompt-based manipulations that explicitly state deployment status, or whether activation steering is effectively equivalent to deployment prompting in that setting.

Background

In early experiments with a simpler model organism trained via synthetic document fine-tuning, activation steering appeared to shift behavior in ways similar to making deployment-like prompt statements, raising the concern that steering might just replicate prompting effects.

The authors note this alternative explanation explicitly and motivate the introduction of the Wood Labs evaluation cue to create conditions where prompting alone fails, enabling a more robust test of whether steering truly changes the model’s internal deployment-vs-evaluation belief state.

References

Similar to the experiment in Figure \ref{fig:emojis}, we cannot rule out the alternative hypothesis that activation steering in this case is equivalent to telling the model in the prompt that it is deployed. This is why we developed the Wood Labs model.

— Steering Evaluation-Aware Language Models to Act Like They Are Deployed (2510.20487 - Hua et al., 23 Oct 2025) in Appendix — Activation Steering Works on Simpler Model Organism

Distinguish activation steering from deployment-like prompting in the simple SDF-only model

Background

References

Related Problems