Verifier hacking under extended training in Trade-R1
Determine whether extending the training duration of the Trade-R1 policy model enables the learned policy to discover strategies that bypass the Retrieval-Augmented Verification protocol with the Triangular Consistency metric (i.e., verifier hacking).
Sponsor
References
Whether longer training might enable the model to discover subtle strategies to bypass the verification protocol (i.e., "verifier hacking") remains an open question.
— Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification
(2601.03948 - Sun et al., 7 Jan 2026) in Limitations, Section 5 (Conclusion)