TreeRL: LLM Reinforcement Learning with On-Policy Tree Search (2506.11902v1)

Published 13 Jun 2025 in cs.LG and cs.CL

Abstract: Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training. Our approach includes intermediate supervision and eliminates the need for a separate reward model training. Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking. We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching. Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. TreeRL is open-sourced at https://github.com/THUDM/TreeRL.

Authors (6)

Zhenyu Hou (20 papers)
Ziniu Hu (51 papers)
Yujiang Li (4 papers)
Rui Lu (28 papers)
Jie Tang (302 papers)
Yuxiao Dong (119 papers)

Summary

The paper introduces TreeRL to integrate on-policy tree search into LLM reinforcement learning, enabling comprehensive process-based supervision.
It replaces separate reward model training with an entropy-guided search that targets high-uncertainty tokens for efficient exploration.
Empirical results demonstrate improved performance in math and code reasoning tasks, suggesting strong potential for future AI advancements.

TreeRL: LLM Reinforcement Learning with On-Policy Tree Search

In this paper, the authors introduce TreeRL, a new reinforcement learning (RL) framework designed to enhance LLMs using on-policy tree search as a central mechanism. Traditional RL approaches often rely on sampling strategies such as independent chain sampling, which focus primarily on outcome-based rewards. These methods, while effective, tend to overlook process-based supervision that harnesses the entire reasoning path, which can immensely benefit the reinforcement learning of reasoning tasks. TreeRL leverages a novel on-policy tree search to generate more nuanced process rewards without requiring separate reward model training.

Technical Contributions

The key innovations introduced by this framework are highlighted through two main components:

On-Policy Tree Search Integration: The TreeRL framework incorporates on-policy tree search during RL training, thereby generating intermediate states for better exploration of reasoning paths. This approach bypasses the necessity for separate reward model training, which typically suffers from distribution mismatch and susceptibility to reward hacking. The model achieves this by treating the learning process as on-policy instead of off-policy, which means that the states explored during learning are consistent with the model's current policy.
Entropy-Guided Tree Search: The paper introduces a cost-effective strategy within the tree search paradigm called EPTree, which aims to enhance search efficiency under a given token budget. Rather than random branching, EPTree uses an entropy-guided approach branching from high-uncertainty intermediate tokens. This strategy results in achieving a higher PassRate, hence a better chance of generating correct answers under fixed computational resources.

Empirical Evaluation

Experiments conducted on diverse benchmarks in math and code reasoning tasks demonstrate the superiority of TreeRL compared to conventional ChainRL methods. Specifically, the use of the entropy-guided tree search leads to a marked improvement in performance, ensuring more efficient exploration and ultimately better reinforcing the reasoning skills of LLMs.

Implications and Future Work

The implications of this research are profound in both practical and theoretical dimensions. Practically, TreeRL is poised to advance the development of LLMs, especially in areas requiring complex reasoning like mathematical problem-solving or code synthesis. The framework pushes the boundaries of how reinforcement learning can be integrated into LLM training to optimize reasoning abilities by using more informed state exploration.

Theoretically, the paper provides insights into combining RL methods with tree search principles, traditionally acknowledged for solving complex decision-making tasks as seen in AlphaZero, but now applied to LLM reasoning. This serves as an encouraging prospect for exploring further integrations in AI methodologies to enhance model alignment and reasoning capabilities.

Future developments could focus on optimizing tree search algorithms for scalability in LLM environments and examining further configurations for reinforcement learning beyond the presented entropy-guided strategy. Research could also explore improvements in real-world applications where reasoning paths are crucial for decision-making, including autonomous systems and advanced analytics in AI-driven processes.

In conclusion, TreeRL has presented a compelling case for the adoption of tree search methods in reinforcement learning to enrich the reasoning capabilities of LLMs, promising to invigorate the evolution of AI towards more intelligent, insightful, and accurate reasoning paradigms.

PDF Markdown

TreeRL: LLM Reinforcement Learning with On-Policy Tree Search (2506.11902v1)

Summary

TreeRL: LLM Reinforcement Learning with On-Policy Tree Search

Technical Contributions

Empirical Evaluation

Implications and Future Work

Related Papers

GitHub

YouTube