Impact of evaluation awareness on estimating true deployment misalignment
Ascertain whether models’ awareness that they are being evaluated, rather than deployed, causes measured misalignment in evaluation settings to overstate or understate misalignment that would manifest in real deployment contexts, and quantify the direction and magnitude of this effect.
References
We have attempted to mitigate this with our code sabotage evaluation, and it isn't clear whether evaluation awareness would cause our results to overstate or understate "true" misalignment.
— Natural Emergent Misalignment from Reward Hacking in Production RL
(2511.18397 - MacDiarmid et al., 23 Nov 2025) in Section 1 (Introduction), Limitations, item 3