Denominator used in Anthropic’s reported reward‑hacking rates

Determine whether the 5% holdout-failure rate and 14% classifier-trigger rate reported for Claude Sonnet 4 on Anthropic’s “reward hacking prone coding tasks v2” are computed as a percentage of tasks that passed all visible tests, or as the percentage of tasks that passed the visible tests but failed the hidden tests.

Background

The paper compares EvilGenie results to figures reported in Anthropic’s system cards but notes ambiguity in how Anthropic normalizes its reported percentages, which hinders direct comparison.

Resolving the denominator ambiguity is necessary to meaningfully align metrics across studies and to understand the relative prevalence of detected reward hacking.

References

It is unclear whether this is a percentage of the tasks which passed all visible tests, or this is the total percentage of tasks which passed the visible tests but failed the hidden tests.

— EvilGenie: A Reward Hacking Benchmark (2511.21654 - Gabor et al., 26 Nov 2025) in Section 7 (Related Work and Comparisons), Subsection "Anthropic's system cards"

Denominator used in Anthropic’s reported reward‑hacking rates

Sponsor

Background

References

Related Problems