Definition scope of Anthropic’s monitor for “reward hacking” (heuristics included or not)

Determine the criteria used by Anthropic’s monitor when flagging runs as "reward hacking" in the “reward hacking prone coding tasks v2” evaluation and whether heuristic solutions, as categorized in EvilGenie, are included in that definition.

Background

The authors note that Anthropic reports a 14% monitor-trigger rate but do not specify whether this includes heuristic solutions that pass tests without fully solving problems, which EvilGenie classifies separately from explicit reward hacking.

Clarifying whether heuristics are counted as reward hacking affects cross-benchmark comparisons and informs how models’ behaviors should be categorized for monitoring and mitigation.

References

Their monitor flags 14% of the runs as reward hacking, it is unclear how exactly they are defining reward hacking here and whether it includes what we call heuristic solutions.

— EvilGenie: A Reward Hacking Benchmark (2511.21654 - Gabor et al., 26 Nov 2025) in Section 7 (Related Work and Comparisons), Subsection "Anthropic's system cards"

Definition scope of Anthropic’s monitor for “reward hacking” (heuristics included or not)

Background

References

Related Problems