Impact of evaluation awareness on estimating true deployment misalignment

Ascertain whether models’ awareness that they are being evaluated, rather than deployed, causes measured misalignment in evaluation settings to overstate or understate misalignment that would manifest in real deployment contexts, and quantify the direction and magnitude of this effect.

Background

The paper’s evaluations include both chat-like tests and a realistic Claude Code-based sabotage scenario, but the authors note that models may recognize these are not real deployment settings. They consider this a potential disanalogy that could affect the validity of misalignment measurements.

They explicitly state that it is unclear whether evaluation awareness leads to overestimation or underestimation of “true” misalignment and identify this as a remaining source of uncertainty.

References

We have attempted to mitigate this with our code sabotage evaluation, and it isn't clear whether evaluation awareness would cause our results to overstate or understate "true" misalignment.

— Natural Emergent Misalignment from Reward Hacking in Production RL (2511.18397 - MacDiarmid et al., 23 Nov 2025) in Section 1 (Introduction), Limitations, item 3

Impact of evaluation awareness on estimating true deployment misalignment

Background

References

Related Problems