TreePS-RAG: Tree Supervision in RAG

Updated 14 February 2026

The paper introduces TreePS-RAG, a framework that models multi-step reasoning and retrieval as a rollout tree for enhanced credit assignment.
It employs Monte Carlo process-level credit assignment to compute step-wise advantages, thereby improving RL training efficiency.
TreePS-RAG achieves consistent improvements on multi-hop QA tasks across benchmarks, outperforming conventional outcome-only and process-supervised methods.

TreePS-RAG is a reinforcement learning (RL) framework for agentic retrieval-augmented generation (RAG) that introduces tree-based process supervision. It models the entire multi-step reasoning and retrieval process as a rollout tree, enabling step-wise credit assignment and process-level supervision without intermediate annotations. TreePS-RAG achieves consistent improvements on multi-hop and general question answering (QA) tasks compared to outcome-supervised and leading process-supervised RL methods (Zhang et al., 11 Jan 2026).

1. Formalization of Agentic RAG as a Rollout Tree

In TreePS-RAG, the agentic RAG paradigm is formulated using the ReAct framework, in which an LLM alternates between reasoning and issuing information retrieval (IR) actions, or emitting an answer to terminate the episode. Each RL episode consists of a sequence:

The state at step $i$ : $s_i = [q,~(r_1, a_1, o_1), \ldots, (r_{i-1}, a_{i-1}, o_{i-1})]$ , where $q$ is the user question, $r_j$ is the reasoning string, $a_j \in \{\textrm{search},~\textrm{answer}\}$ , and $o_j$ is either search results or null.
The action space: $A = \{\textrm{search},~\textrm{answer}\}$ .
The policy samples: $(r_i, a_i) \sim \pi_\theta(\cdot | s_i)$ .

TreePS-RAG represents possible agentic trajectories as a tree $T = (N, E)$ :

Each node $n_i$ corresponds to $(r_i, a_i, o_i)$ .
The root $n_{\mathrm{root}}$ is $[q]$ .
Edges represent the transition to children by available actions.
Answering nodes are leaves; nodes with $a_i = \textrm{search}$ can be expanded. A root-to-leaf path denotes a complete trajectory.

2. Monte Carlo Process-Level Credit Assignment

Standard RL for agentic RAG applies only sparse reward at the root—matching the model's final answer to ground truth. TreePS-RAG propagates this outcome reward back to all process steps using Monte Carlo estimation over the tree:

For leaf trajectory $y$ , the reward is $R(y) = \textrm{EM}(a_\textrm{pred}, a_\textrm{gold}) \in \{0,1\}$ with $\textrm{EM}$ denoting exact match.
For non-leaf node $n$ , let $L(n)$ be its descendant leaves. Estimated value:

$V(n) = \frac{1}{|L(n)|} \sum_{\ell \in L(n)} R(\ell).$

Step-wise process advantage:

$A_{\mathrm{global}}(n) = V(n) - V(n_\mathrm{root}), ~~ A_{\mathrm{local}}(n) = V(n) - V(\mathrm{parent}(n))$

$A(n) = \frac{1}{\sqrt{|L(n)|}} \left[2V(n) - V(n_\mathrm{root}) - V(\mathrm{parent}(n)) \right].$

$A(n)$ is used as a dense RL reward for $(r_i, a_i)$ at each step, enabling more informative credit assignment than outcome-only RL.

3. Efficient Online Tree Construction and Pruning

TreePS-RAG presents an online, generate-then-prune algorithm for tractable rollout tree construction:

Given constraints: budget $N$ (number of root-to-leaf rollouts), depth $D$ , and per-node retention $N_\text{retain}$ .
At each depth $d$ , for each node, sample $B_d = \lceil N / n_{d-1} \rceil$ children from $\pi_\theta$ .
For each "search" action, retrieve candidate passages; compute Jaccard similarity over passage sets for child nodes; cluster by hierarchical clustering with $1-$Jaccard as distance.
Retain one representative per cluster—this preserves branch diversity and keeps total node count $O(ND)$ .

Algorithm 1: Online Tree Construction (abridged)

for d = 1, ..., D:
    for parent in M(d-1):
        candidates = [sample child from πθ]
        C_search, C_ans = partition(candidates)
        # Add answer children as leaves
        # For search-children: retrieve passages, cluster, prune
        clusters = hierarchical_cluster(C_search, distance=1 - Jaccard)
        kept = [pick one per cluster]
        M(d).extend(kept)

This approach ensures computational cost remains comparable to

N

flat rollouts, but with greater exploratory coverage.

4. RL Objective and Training Integration

With the constructed tree, $N$ root-to-leaf trajectories are sampled. The step-level process advantage $A(n)$ is applied to each token of the corresponding $(r_i, a_i)$ generation, masking observations. Training follows a PPO-style policy-gradient: $\mathcal{J}(\theta) = \mathbb{E}_{y \sim \pi_{\theta_{\mathrm{old}}}} \sum_{t=1}^{|y|} A_t \log \frac{\pi_\theta(y_t | y_{<t})}{\pi_{\theta_{\mathrm{old}}}(y_t|y_{<t})} - \beta~\mathrm{KL}\left[\pi_\theta\|\pi_\mathrm{ref}\right]$ Alternatively, the RL objective combines outcome ( $R(y)$ ) and process advantages $\sum_i A(n_i)$ : $\max_\theta \mathbb{E}_{y \sim \pi_\theta}\left[ R(y) + \lambda \sum_{i=1}^L A(n_i) \right]$

5. Empirical Performance and Benchmarking

TreePS-RAG was evaluated across seven QA benchmarks: HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle, Natural Questions, TriviaQA, and PopQA. The models tested included Qwen2.5-3B, Qwen2.5-7B, Qwen3-4B-Instruct-2507, and Qwen3-8B. Metrics used: exact match (EM).

Results (EM, Qwen3-4B-Instruct-2507, rollout $N=8$ )

Task	Search-R1	TreePS-RAG	$\Delta$
HotpotQA	0.474	0.480	+0.006
2Wiki	0.517	0.541	+0.024
MuSiQue	0.225	0.233	+0.008
Bamboogle	0.536	0.536	—
Trivia	0.675	0.680	+0.005
PopQA	0.462	0.488	+0.026
NQ	0.447	0.476	+0.029

TreePS-RAG yields an average gain of approximately +0.020 EM across the tasks. Gains are consistent across all backbone scales and in/out-of-domain settings, outperforming outcome-only RL (Search-R1) as well as process-supervised methods such as ReasonRAG, StepSearch, and GiGPO (Zhang et al., 11 Jan 2026).

6. Ablation Studies and Analysis

Ablation experiments investigate the key contributions of process advantage and similarity-based pruning:

Removing process advantage but keeping tree rollout recovers performance comparable to outcome-only Search-R1. Without pruning, multi-hop QA EM drops sharply, highlighting the necessity of exploration: | Variant | Avg EM | |--------------------------------------------|---------| | Search-R1 (GRPO) | 0.490 | | Ours w/o process advantage (PA) | 0.480 | | Ours w/o PA & similarity-pruning (SP) | 0.452 | | Ours (full) | 0.490 |
Increasing tree scale (branching factors, $N_\text{retain}$ ) yields higher EM (default 0.490, larger tree 0.495).
In continuation experiments, process-supervised checkpoints recover from failed prefixes more successfully, indicating clearer intermediate reasoning and better error correction.

TreePS-RAG is distinct from methods such as the implicit, aggregated summary RAG approach in (Gupte et al., 12 Oct 2025) and Tree-RAG (Fatehkia et al., 2024):

(Gupte et al., 12 Oct 2025) addresses tree-structured knowledge summarization and indexing for efficient classical RAG retrieval, not agentic, multi-step RL for reasoning/retrieval.
(Fatehkia et al., 2024) uses tree-structured context augmentation for entity hierarchies, but does not model the multi-step action space as a rollout tree or introduce RL-based process supervision.
TreePS-RAG focuses specifically on integrating process-level Monte Carlo credit assignment into PPO-style RL training on agentic RAG tasks, using tree-based sampling and similarity-pruned exploration. A plausible implication is that tree-based process supervision, as exemplified by TreePS-RAG, is orthogonal and potentially complementary to tree-structured knowledge aggregation for RAG indexing.

8. Limitations and Prospects for Extension

TreePS-RAG achieves step-wise credit assignment without needing human-provided intermediate labels or process-level annotation. However, the current framework relies on Monte Carlo estimates over finite sampled trees, and the accuracy of step advantages can be influenced by tree size and diversity. Future research may investigate extensions to denser reward structures, more scalable diversity mechanisms, or integration with structured implicit-knowledge summaries for enhanced retrieval-grounded reasoning (Zhang et al., 11 Jan 2026, Gupte et al., 12 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (3)

TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG (2026)

Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures (2025)

T-RAG: Lessons from the LLM Trenches (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TreePS-RAG.

TreePS-RAG: Tree Supervision in RAG

1. Formalization of Agentic RAG as a Rollout Tree

2. Monte Carlo Process-Level Credit Assignment

3. Efficient Online Tree Construction and Pruning

4. RL Objective and Training Integration

5. Empirical Performance and Benchmarking

6. Ablation Studies and Analysis

8. Limitations and Prospects for Extension

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TreePS-RAG: Tree Supervision in RAG

1. Formalization of Agentic RAG as a Rollout Tree

2. Monte Carlo Process-Level Credit Assignment

3. Efficient Online Tree Construction and Pruning

4. RL Objective and Training Integration

5. Empirical Performance and Benchmarking

6. Ablation Studies and Analysis

7. Position within the Landscape and Distinctions from Related Work

8. Limitations and Prospects for Extension

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research