Reliable detection of reward hacking without success-rate evaluation
Develop reliable, scalable methods to detect over-optimization (reward hacking) during reinforcement learning from feedback of large language model agents trained with process reward models, without relying on task success-rate evaluation to assess true performance.
Sponsor
References
An open question remains how to reliably detect over-optimization without evaluating the success rate (which is difficult to scale).
— Process Reward Models for LLM Agents: Practical Framework and Directions
(2502.10325 - Choudhury, 14 Feb 2025) in Section 2.3 (Experiments), paragraph "Question: Can we measure and mitigate reward hacking?" adjacent to Figure "Process Reward Hacking"