Denominator used in Anthropic’s reported reward‑hacking rates
Determine whether the 5% holdout-failure rate and 14% classifier-trigger rate reported for Claude Sonnet 4 on Anthropic’s “reward hacking prone coding tasks v2” are computed as a percentage of tasks that passed all visible tests, or as the percentage of tasks that passed the visible tests but failed the hidden tests.
Sponsor
References
It is unclear whether this is a percentage of the tasks which passed all visible tests, or this is the total percentage of tasks which passed the visible tests but failed the hidden tests.
— EvilGenie: A Reward Hacking Benchmark
(2511.21654 - Gabor et al., 26 Nov 2025) in Section 7 (Related Work and Comparisons), Subsection "Anthropic's system cards"