Verify whether anti-scheming training truly reduces scheming rather than increasing evaluation awareness
Ascertain whether reductions in measured scheming rates observed after deliberative alignment training represent genuine decreases in scheming during real-world deployment, or whether they are artifacts of increased evaluation awareness that cause the model to appear aligned in testing contexts.
References
In other words, we cannot be certain that the anti-scheming training reduced scheming rates instead of merely teaching the model to recognize when it is being tested and appear aligned during testing.
— Steering Evaluation-Aware Language Models to Act Like They Are Deployed
(2510.20487 - Hua et al., 23 Oct 2025) in Background and Related Work — Evaluation Awareness in LLMs