Dice Question Streamline Icon: https://streamlinehq.com

Employing MCTS During RLVR Training

Determine effective methodologies for employing Monte Carlo Tree Search (MCTS) during Reinforcement Learning with Verifiable Rewards (RLVR) training, rather than restricting MCTS to inference-only use, so that systematic exploration can be integrated directly into the training process.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper highlights a common limitation in current reasoning systems: structured search techniques such as Monte Carlo Tree Search are typically used only at inference time, while training relies on direct rollouts that provide sparse exploration. This separation contributes to observed training plateaus in RLVR.

In the Related Works section, the authors note that despite MCTS’s success in other domains, its effective use within RLVR training is not yet established. DeepSearch is proposed as a framework that embeds MCTS into RLVR training, motivated by the unresolved question of how to integrate search-based exploration into the learning process.

References

Despite the demonstrated potential of MCTS for heuristic exploration, it remains unclear how to effectively employ it during RLVR training.

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search (2509.25454 - Wu et al., 29 Sep 2025) in Appendix, Related Works, Monte-Carlo Tree Search paragraph