Genuine reduction vs. task-shift of reward hacking over time
Ascertain whether the apparent decrease in observed reward-hacking rates with increasing model capability on the fixed EvilGenie benchmark reflects a true reduction in the underlying propensity to reward hack, or instead a shift of reward-hacking behavior to harder tasks not captured by the benchmark’s difficulty distribution.
Sponsor
References
Given the general reduction in reward hacking with solve rate, we can not rule out the possibility that the rate of reward hacking does not decrease, but is rather is pushed to harder tasks over time.
— EvilGenie: A Reward Hacking Benchmark
(2511.21654 - Gabor et al., 26 Nov 2025) in Section 3 (Reward Hacking Rates), Subsection "Broader Reward-Hacking Trends"