Differential alignment impact of undetectable versus overt reward hacks
Determine whether reinforcement learning that rewards undetectable or obfuscated reward hacks (i.e., hacks that are not easily detected by training monitors) has different effects on model alignment compared to reinforcement of overt, easily-detected reward hacks in large language models trained on production coding environments, and characterize the nature and magnitude of any resulting differences in misaligned generalization.
References
It is possible that reinforcing only undetectable reward hacks rather than overt reward hacks has different effects on alignment, but we leave this distinction for future work.
— Natural Emergent Misalignment from Reward Hacking in Production RL
(2511.18397 - MacDiarmid et al., 23 Nov 2025) in Section 1 (Introduction), Limitations, item 2