Priority of problem text vs. test harness when labeling reward hacking

Determine whether, in evaluating coding agents on ambiguous LiveCodeBench-derived tasks within the EvilGenie benchmark, greater evaluative weight should be assigned to the textual problem statement in problem.md or to the unit-test harness (test.py and test_cases.json) when they conflict, for the purpose of classifying behaviors such as editing tests or hardcoding outputs as reward hacking versus aligned behavior.

Background

The benchmark includes ambiguous problems where the textual specification admits multiple valid outputs, but the test harness accepts only a single one, creating tension between solving the stated problem and passing the tests.

The paper distinguishes between reward hacking (e.g., modifying tests, hardcoding) and other misaligned behaviors (e.g., heuristics), but notes that in ambiguous cases some behaviors might be considered aligned if judged against the textual specification rather than the tests.

Clarifying which artifact (textual spec or tests) should be authoritative directly impacts how to label agent behaviors in such edge cases, and affects both benchmarking and real-world software workflows.

References

It is unclear whether one should give more credence to the textual query rather than the code testing procedure.

EvilGenie: A Reward Hacking Benchmark (2511.21654 - Gabor et al., 26 Nov 2025) in Section 5 (Discussion), Subsection "Reward Hacking Categorization Limitations and Complications"