TreePS-RAG: Tree Supervision in RAG
- The paper introduces TreePS-RAG, a framework that models multi-step reasoning and retrieval as a rollout tree for enhanced credit assignment.
- It employs Monte Carlo process-level credit assignment to compute step-wise advantages, thereby improving RL training efficiency.
- TreePS-RAG achieves consistent improvements on multi-hop QA tasks across benchmarks, outperforming conventional outcome-only and process-supervised methods.
TreePS-RAG is a reinforcement learning (RL) framework for agentic retrieval-augmented generation (RAG) that introduces tree-based process supervision. It models the entire multi-step reasoning and retrieval process as a rollout tree, enabling step-wise credit assignment and process-level supervision without intermediate annotations. TreePS-RAG achieves consistent improvements on multi-hop and general question answering (QA) tasks compared to outcome-supervised and leading process-supervised RL methods (Zhang et al., 11 Jan 2026).
1. Formalization of Agentic RAG as a Rollout Tree
In TreePS-RAG, the agentic RAG paradigm is formulated using the ReAct framework, in which an LLM alternates between reasoning and issuing information retrieval (IR) actions, or emitting an answer to terminate the episode. Each RL episode consists of a sequence:
- The state at step : , where is the user question, is the reasoning string, , and is either search results or null.
- The action space: .
- The policy samples: .
TreePS-RAG represents possible agentic trajectories as a tree :
- Each node corresponds to .
- The root is .
- Edges represent the transition to children by available actions.
- Answering nodes are leaves; nodes with can be expanded. A root-to-leaf path denotes a complete trajectory.
2. Monte Carlo Process-Level Credit Assignment
Standard RL for agentic RAG applies only sparse reward at the root—matching the model's final answer to ground truth. TreePS-RAG propagates this outcome reward back to all process steps using Monte Carlo estimation over the tree:
- For leaf trajectory , the reward is with denoting exact match.
- For non-leaf node , let be its descendant leaves. Estimated value:
- Step-wise process advantage:
is used as a dense RL reward for at each step, enabling more informative credit assignment than outcome-only RL.
3. Efficient Online Tree Construction and Pruning
TreePS-RAG presents an online, generate-then-prune algorithm for tractable rollout tree construction:
- Given constraints: budget (number of root-to-leaf rollouts), depth , and per-node retention .
- At each depth , for each node, sample children from .
- For each "search" action, retrieve candidate passages; compute Jaccard similarity over passage sets for child nodes; cluster by hierarchical clustering with $1-$Jaccard as distance.
- Retain one representative per cluster—this preserves branch diversity and keeps total node count .
Algorithm 1: Online Tree Construction (abridged)
1 2 3 4 5 6 7 8 9 |
for d = 1, ..., D: for parent in M(d-1): candidates = [sample child from πθ] C_search, C_ans = partition(candidates) # Add answer children as leaves # For search-children: retrieve passages, cluster, prune clusters = hierarchical_cluster(C_search, distance=1 - Jaccard) kept = [pick one per cluster] M(d).extend(kept) |
4. RL Objective and Training Integration
With the constructed tree, root-to-leaf trajectories are sampled. The step-level process advantage is applied to each token of the corresponding generation, masking observations. Training follows a PPO-style policy-gradient: Alternatively, the RL objective combines outcome () and process advantages :
5. Empirical Performance and Benchmarking
TreePS-RAG was evaluated across seven QA benchmarks: HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle, Natural Questions, TriviaQA, and PopQA. The models tested included Qwen2.5-3B, Qwen2.5-7B, Qwen3-4B-Instruct-2507, and Qwen3-8B. Metrics used: exact match (EM).
Results (EM, Qwen3-4B-Instruct-2507, rollout )
| Task | Search-R1 | TreePS-RAG | |
|---|---|---|---|
| HotpotQA | 0.474 | 0.480 | +0.006 |
| 2Wiki | 0.517 | 0.541 | +0.024 |
| MuSiQue | 0.225 | 0.233 | +0.008 |
| Bamboogle | 0.536 | 0.536 | — |
| Trivia | 0.675 | 0.680 | +0.005 |
| PopQA | 0.462 | 0.488 | +0.026 |
| NQ | 0.447 | 0.476 | +0.029 |
TreePS-RAG yields an average gain of approximately +0.020 EM across the tasks. Gains are consistent across all backbone scales and in/out-of-domain settings, outperforming outcome-only RL (Search-R1) as well as process-supervised methods such as ReasonRAG, StepSearch, and GiGPO (Zhang et al., 11 Jan 2026).
6. Ablation Studies and Analysis
Ablation experiments investigate the key contributions of process advantage and similarity-based pruning:
- Removing process advantage but keeping tree rollout recovers performance comparable to outcome-only Search-R1. Without pruning, multi-hop QA EM drops sharply, highlighting the necessity of exploration: | Variant | Avg EM | |--------------------------------------------|---------| | Search-R1 (GRPO) | 0.490 | | Ours w/o process advantage (PA) | 0.480 | | Ours w/o PA & similarity-pruning (SP) | 0.452 | | Ours (full) | 0.490 |
- Increasing tree scale (branching factors, ) yields higher EM (default 0.490, larger tree 0.495).
- In continuation experiments, process-supervised checkpoints recover from failed prefixes more successfully, indicating clearer intermediate reasoning and better error correction.
7. Position within the Landscape and Distinctions from Related Work
TreePS-RAG is distinct from methods such as the implicit, aggregated summary RAG approach in (Gupte et al., 12 Oct 2025) and Tree-RAG (Fatehkia et al., 2024):
- (Gupte et al., 12 Oct 2025) addresses tree-structured knowledge summarization and indexing for efficient classical RAG retrieval, not agentic, multi-step RL for reasoning/retrieval.
- (Fatehkia et al., 2024) uses tree-structured context augmentation for entity hierarchies, but does not model the multi-step action space as a rollout tree or introduce RL-based process supervision.
- TreePS-RAG focuses specifically on integrating process-level Monte Carlo credit assignment into PPO-style RL training on agentic RAG tasks, using tree-based sampling and similarity-pruned exploration. A plausible implication is that tree-based process supervision, as exemplified by TreePS-RAG, is orthogonal and potentially complementary to tree-structured knowledge aggregation for RAG indexing.
8. Limitations and Prospects for Extension
TreePS-RAG achieves step-wise credit assignment without needing human-provided intermediate labels or process-level annotation. However, the current framework relies on Monte Carlo estimates over finite sampled trees, and the accuracy of step advantages can be influenced by tree size and diversity. Future research may investigate extensions to denser reward structures, more scalable diversity mechanisms, or integration with structured implicit-knowledge summaries for enhanced retrieval-grounded reasoning (Zhang et al., 11 Jan 2026, Gupte et al., 12 Oct 2025).