Verifier hacking under extended training in Trade-R1

Determine whether extending the training duration of the Trade-R1 policy model enables the learned policy to discover strategies that bypass the Retrieval-Augmented Verification protocol with the Triangular Consistency metric (i.e., verifier hacking).

Background

Trade-R1 introduces a Retrieval-Augmented Verification protocol with a Triangular Consistency metric to gate stochastic market rewards by checking pairwise alignment among retrieved evidence, the model’s reasoning chain, and the final decision. The training in reported experiments was stopped at a predefined step due to computational budget constraints rather than full convergence.

The authors caution that longer training might allow the model to learn to circumvent the verification protocol—a potential failure mode they term "verifier hacking." Establishing whether this occurs under extended training is thus an explicit unresolved question.

References

Whether longer training might enable the model to discover subtle strategies to bypass the verification protocol (i.e., "verifier hacking") remains an open question.

Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification (2601.03948 - Sun et al., 7 Jan 2026) in Limitations, Section 5 (Conclusion)