Priority of problem text vs. test harness when labeling reward hacking
Determine whether, in evaluating coding agents on ambiguous LiveCodeBench-derived tasks within the EvilGenie benchmark, greater evaluative weight should be assigned to the textual problem statement in problem.md or to the unit-test harness (test.py and test_cases.json) when they conflict, for the purpose of classifying behaviors such as editing tests or hardcoding outputs as reward hacking versus aligned behavior.
Sponsor
References
It is unclear whether one should give more credence to the textual query rather than the code testing procedure.
— EvilGenie: A Reward Hacking Benchmark
(2511.21654 - Gabor et al., 26 Nov 2025) in Section 5 (Discussion), Subsection "Reward Hacking Categorization Limitations and Complications"