Definition scope of Anthropic’s monitor for “reward hacking” (heuristics included or not)
Determine the criteria used by Anthropic’s monitor when flagging runs as "reward hacking" in the “reward hacking prone coding tasks v2” evaluation and whether heuristic solutions, as categorized in EvilGenie, are included in that definition.
Sponsor
References
Their monitor flags 14% of the runs as reward hacking, it is unclear how exactly they are defining reward hacking here and whether it includes what we call heuristic solutions.
— EvilGenie: A Reward Hacking Benchmark
(2511.21654 - Gabor et al., 26 Nov 2025) in Section 7 (Related Work and Comparisons), Subsection "Anthropic's system cards"