Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 26 tok/s Pro
GPT-4o 92 tok/s
GPT OSS 120B 452 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling (2508.17445v1)

Published 24 Aug 2025 in cs.LG and cs.CL

Abstract: Recent advancements in aligning LLMs via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a novel RL framework that reformulates LLM sequence generation as a tree-structured search, reducing redundant computation.
  • It leverages segment-level sampling and hierarchical advantage estimation to boost trajectory sampling speed by up to 40% and improve overall performance.
  • TreePO’s heuristic branching and dynamic pruning mechanisms balance exploration with computational efficiency, enabling scalable training for long-horizon reasoning.

TreePO: Heuristic Tree-based Policy Optimization for Efficient and Effective LLM Reasoning

Introduction

TreePO presents a novel reinforcement learning (RL) framework for LLMs, addressing the dual challenges of computational inefficiency and limited exploration in complex reasoning tasks. The method reconceptualizes sequence generation as a tree-structured search, leveraging shared prefixes and dynamic branching to amortize computation and enhance exploration diversity. TreePO integrates segment-level sampling, early stopping, and a hierarchical advantage estimator, enabling more precise credit assignment and efficient training from base models without prior supervised fine-tuning.

Tree-based Sampling: Algorithmic Design and Empirical Motivation

Standard RL rollouts for LLMs generate multiple independent trajectories per query, leading to redundant computation and inefficient use of KV caches. Empirical analysis reveals that stochastic rollouts from the same prompt share extensive reasoning prefixes, motivating a tree-structured approach to sequence generation. Figure 1

Figure 1: Multiple sampled trajectories from the same prompt, with shared reasoning segments highlighted; key problem-solving steps are consistently reproduced despite stochasticity.

TreePO formalizes the rollout process as a tree, where each node represents a segment of reasoning, and branches correspond to divergent continuations. The algorithm maintains a prompt queue, dynamically forks branches based on heuristic policies, and prunes low-value paths via early stopping. Branching budgets and fallback mechanisms are designed to balance exploration and computational efficiency, with segment-level control enabling fine-grained management of the search space. Figure 2

Figure 2

Figure 2

Figure 2: Validation performance curves and demonstration of TreePO sampling; tree-based sampling stabilizes training and amortizes computation across shared prefixes.

Hierarchical Advantage Estimation

TreePO introduces a segment-level, tree-based advantage estimator, extending beyond MCTS-like parent-child credit assignment. Each trajectory is decomposed into segments, and subgroups are defined by shared predecessor nodes at each tree depth. The advantage for a trajectory is computed as the mean-pooled, variance-normalized reward difference within each subgroup, aggregating hierarchical credit signals. Figure 3

Figure 3: TreePO advantage estimation; sub-group advantages are calculated for each node, enabling robust credit assignment based on collective descendant outcomes.

Empirical studies demonstrate that simple averaging across subgroups yields higher accuracy and more stable entropy than subgroup-size weighting, which overemphasizes large/easy subgroups. Dynamic rejection sampling at the subgroup level degrades performance, indicating that extreme subgroups provide valuable calibration. Token-aligned segments are critical for stable optimization; misaligned fallback inflates response length and reduces accuracy. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Study on the terms in TreePO advantage; subgroup-size weighted aggregation is compared to simple averaging, revealing superior stability and accuracy for the latter.

Sampling Efficiency and Scaling Analysis

TreePO achieves substantial efficiency gains by amortizing computation over shared prefixes and enabling parallelized segment-level decoding. Offline efficiency analyses across Qwen2.5 variants show average improvements of +40% in trajectories per second (TrajPS) and +30% in tokens per second (TokenPS) compared to conventional sampling. Figure 5

Figure 5

Figure 5

Figure 5: Qwen2.5-7B-Instruct throughput comparison; tree-based sampling yields higher TrajPS and TokenPS across tree depths.

Efficiency peaks at intermediate depth–segment configurations, with optimal settings being model-specific. For instruction-tuned models, mid-depth trees balance batched prefilling and parallel decoding, while math-focused models benefit from longer segments and shallower trees. Rollout scaling is workload-dependent; shared-prefix reuse boosts throughput for structured tasks, but excessive divergence degrades batching efficiency. Figure 6

Figure 6: Test-time compute scaling of TreePO sampling; larger divergence factors achieve higher peak performance at increased compute cost, enabling flexible compute-optimal inference.

Heuristic Branching and Exploration Control

TreePO enables heuristic control over branching assignment at each segment, leveraging log probabilities to allocate branching budgets. Experiments reveal that monotonous patterns—such as always favoring low-probability paths—harm performance, increasing entropy and response length without improving accuracy. Exploration must be meaningful; indiscriminate allocation to low-probability segments leads to irrelevant reasoning paths. Figure 7

Figure 7

Figure 7

Figure 7

Figure 7: Probability-based heuristic tree branching budget assignment; static controls underperform, while balanced strategies maintain effective exploration-exploitation trade-offs.

Main Results and Trade-offs

TreePO sampling and advantage estimation consistently improve training stability and computational efficiency. Across benchmarks, TreePO boosts overall accuracy (e.g., from 46.63% to 54.61% over GRPO) and reduces GPU hours by 12–43%. While tree-based sampling may converge more slowly or yield slightly lower peak accuracy in some configurations, the trade-off is favorable for large-scale training.

Implications and Future Directions

TreePO's segment-based tree search and hierarchical advantage estimation provide a scalable framework for RL-based LLM post-training. The method is particularly suited for long-horizon reasoning, multi-turn dialogue, and multi-agent systems, where efficient exploration and precise credit assignment are critical. The flexible compute scaling and heuristic control mechanisms enable adaptive inference strategies tailored to resource constraints.

Theoretical implications include the potential for more robust credit assignment in sparse-reward settings and the integration of tree-based exploration with other RL paradigms. Practically, TreePO offers a path toward efficient, scalable RL training for LLMs, reducing the sample and compute requirements without sacrificing performance.

Conclusion

TreePO advances policy optimization for LLMs by reformulating rollouts as tree-structured searches and introducing hierarchical advantage estimation. The framework achieves significant efficiency gains, stable training, and strong performance across reasoning benchmarks. Its structural modeling and adaptive control mechanisms open new avenues for scaling RL to complex, long-horizon tasks, with implications for both theoretical research and practical deployment in AI systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.