Papers
Topics
Authors
Recent
Search
2000 character limit reached

TreePS-RAG: Tree Supervision in RAG

Updated 14 February 2026
  • The paper introduces TreePS-RAG, a framework that models multi-step reasoning and retrieval as a rollout tree for enhanced credit assignment.
  • It employs Monte Carlo process-level credit assignment to compute step-wise advantages, thereby improving RL training efficiency.
  • TreePS-RAG achieves consistent improvements on multi-hop QA tasks across benchmarks, outperforming conventional outcome-only and process-supervised methods.

TreePS-RAG is a reinforcement learning (RL) framework for agentic retrieval-augmented generation (RAG) that introduces tree-based process supervision. It models the entire multi-step reasoning and retrieval process as a rollout tree, enabling step-wise credit assignment and process-level supervision without intermediate annotations. TreePS-RAG achieves consistent improvements on multi-hop and general question answering (QA) tasks compared to outcome-supervised and leading process-supervised RL methods (Zhang et al., 11 Jan 2026).

1. Formalization of Agentic RAG as a Rollout Tree

In TreePS-RAG, the agentic RAG paradigm is formulated using the ReAct framework, in which an LLM alternates between reasoning and issuing information retrieval (IR) actions, or emitting an answer to terminate the episode. Each RL episode consists of a sequence:

  • The state at step ii: si=[q, (r1,a1,o1),,(ri1,ai1,oi1)]s_i = [q,~(r_1, a_1, o_1), \ldots, (r_{i-1}, a_{i-1}, o_{i-1})], where qq is the user question, rjr_j is the reasoning string, aj{search, answer}a_j \in \{\textrm{search},~\textrm{answer}\}, and ojo_j is either search results or null.
  • The action space: A={search, answer}A = \{\textrm{search},~\textrm{answer}\}.
  • The policy samples: (ri,ai)πθ(si)(r_i, a_i) \sim \pi_\theta(\cdot | s_i).

TreePS-RAG represents possible agentic trajectories as a tree T=(N,E)T = (N, E):

  • Each node nin_i corresponds to (ri,ai,oi)(r_i, a_i, o_i).
  • The root nrootn_{\mathrm{root}} is [q][q].
  • Edges represent the transition to children by available actions.
  • Answering nodes are leaves; nodes with ai=searcha_i = \textrm{search} can be expanded. A root-to-leaf path denotes a complete trajectory.

2. Monte Carlo Process-Level Credit Assignment

Standard RL for agentic RAG applies only sparse reward at the root—matching the model's final answer to ground truth. TreePS-RAG propagates this outcome reward back to all process steps using Monte Carlo estimation over the tree:

  • For leaf trajectory yy, the reward is R(y)=EM(apred,agold){0,1}R(y) = \textrm{EM}(a_\textrm{pred}, a_\textrm{gold}) \in \{0,1\} with EM\textrm{EM} denoting exact match.
  • For non-leaf node nn, let L(n)L(n) be its descendant leaves. Estimated value:

V(n)=1L(n)L(n)R().V(n) = \frac{1}{|L(n)|} \sum_{\ell \in L(n)} R(\ell).

  • Step-wise process advantage:

Aglobal(n)=V(n)V(nroot),  Alocal(n)=V(n)V(parent(n))A_{\mathrm{global}}(n) = V(n) - V(n_\mathrm{root}), ~~ A_{\mathrm{local}}(n) = V(n) - V(\mathrm{parent}(n))

A(n)=1L(n)[2V(n)V(nroot)V(parent(n))].A(n) = \frac{1}{\sqrt{|L(n)|}} \left[2V(n) - V(n_\mathrm{root}) - V(\mathrm{parent}(n)) \right].

A(n)A(n) is used as a dense RL reward for (ri,ai)(r_i, a_i) at each step, enabling more informative credit assignment than outcome-only RL.

3. Efficient Online Tree Construction and Pruning

TreePS-RAG presents an online, generate-then-prune algorithm for tractable rollout tree construction:

  • Given constraints: budget NN (number of root-to-leaf rollouts), depth DD, and per-node retention NretainN_\text{retain}.
  • At each depth dd, for each node, sample Bd=N/nd1B_d = \lceil N / n_{d-1} \rceil children from πθ\pi_\theta.
  • For each "search" action, retrieve candidate passages; compute Jaccard similarity over passage sets for child nodes; cluster by hierarchical clustering with $1-$Jaccard as distance.
  • Retain one representative per cluster—this preserves branch diversity and keeps total node count O(ND)O(ND).

Algorithm 1: Online Tree Construction (abridged)

1
2
3
4
5
6
7
8
9
for d = 1, ..., D:
    for parent in M(d-1):
        candidates = [sample child from πθ]
        C_search, C_ans = partition(candidates)
        # Add answer children as leaves
        # For search-children: retrieve passages, cluster, prune
        clusters = hierarchical_cluster(C_search, distance=1 - Jaccard)
        kept = [pick one per cluster]
        M(d).extend(kept)
This approach ensures computational cost remains comparable to NN flat rollouts, but with greater exploratory coverage.

4. RL Objective and Training Integration

With the constructed tree, NN root-to-leaf trajectories are sampled. The step-level process advantage A(n)A(n) is applied to each token of the corresponding (ri,ai)(r_i, a_i) generation, masking observations. Training follows a PPO-style policy-gradient: J(θ)=Eyπθoldt=1yAtlogπθ(yty<t)πθold(yty<t)β KL[πθπref]\mathcal{J}(\theta) = \mathbb{E}_{y \sim \pi_{\theta_{\mathrm{old}}}} \sum_{t=1}^{|y|} A_t \log \frac{\pi_\theta(y_t | y_{<t})}{\pi_{\theta_{\mathrm{old}}}(y_t|y_{<t})} - \beta~\mathrm{KL}\left[\pi_\theta\|\pi_\mathrm{ref}\right] Alternatively, the RL objective combines outcome (R(y)R(y)) and process advantages iA(ni)\sum_i A(n_i): maxθEyπθ[R(y)+λi=1LA(ni)]\max_\theta \mathbb{E}_{y \sim \pi_\theta}\left[ R(y) + \lambda \sum_{i=1}^L A(n_i) \right]

5. Empirical Performance and Benchmarking

TreePS-RAG was evaluated across seven QA benchmarks: HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle, Natural Questions, TriviaQA, and PopQA. The models tested included Qwen2.5-3B, Qwen2.5-7B, Qwen3-4B-Instruct-2507, and Qwen3-8B. Metrics used: exact match (EM).

Results (EM, Qwen3-4B-Instruct-2507, rollout N=8N=8)

Task Search-R1 TreePS-RAG Δ\Delta
HotpotQA 0.474 0.480 +0.006
2Wiki 0.517 0.541 +0.024
MuSiQue 0.225 0.233 +0.008
Bamboogle 0.536 0.536
Trivia 0.675 0.680 +0.005
PopQA 0.462 0.488 +0.026
NQ 0.447 0.476 +0.029

TreePS-RAG yields an average gain of approximately +0.020 EM across the tasks. Gains are consistent across all backbone scales and in/out-of-domain settings, outperforming outcome-only RL (Search-R1) as well as process-supervised methods such as ReasonRAG, StepSearch, and GiGPO (Zhang et al., 11 Jan 2026).

6. Ablation Studies and Analysis

Ablation experiments investigate the key contributions of process advantage and similarity-based pruning:

  • Removing process advantage but keeping tree rollout recovers performance comparable to outcome-only Search-R1. Without pruning, multi-hop QA EM drops sharply, highlighting the necessity of exploration: | Variant | Avg EM | |--------------------------------------------|---------| | Search-R1 (GRPO) | 0.490 | | Ours w/o process advantage (PA) | 0.480 | | Ours w/o PA & similarity-pruning (SP) | 0.452 | | Ours (full) | 0.490 |
  • Increasing tree scale (branching factors, NretainN_\text{retain}) yields higher EM (default 0.490, larger tree 0.495).
  • In continuation experiments, process-supervised checkpoints recover from failed prefixes more successfully, indicating clearer intermediate reasoning and better error correction.

TreePS-RAG is distinct from methods such as the implicit, aggregated summary RAG approach in (Gupte et al., 12 Oct 2025) and Tree-RAG (Fatehkia et al., 2024):

  • (Gupte et al., 12 Oct 2025) addresses tree-structured knowledge summarization and indexing for efficient classical RAG retrieval, not agentic, multi-step RL for reasoning/retrieval.
  • (Fatehkia et al., 2024) uses tree-structured context augmentation for entity hierarchies, but does not model the multi-step action space as a rollout tree or introduce RL-based process supervision.
  • TreePS-RAG focuses specifically on integrating process-level Monte Carlo credit assignment into PPO-style RL training on agentic RAG tasks, using tree-based sampling and similarity-pruned exploration. A plausible implication is that tree-based process supervision, as exemplified by TreePS-RAG, is orthogonal and potentially complementary to tree-structured knowledge aggregation for RAG indexing.

8. Limitations and Prospects for Extension

TreePS-RAG achieves step-wise credit assignment without needing human-provided intermediate labels or process-level annotation. However, the current framework relies on Monte Carlo estimates over finite sampled trees, and the accuracy of step advantages can be influenced by tree size and diversity. Future research may investigate extensions to denser reward structures, more scalable diversity mechanisms, or integration with structured implicit-knowledge summaries for enhanced retrieval-grounded reasoning (Zhang et al., 11 Jan 2026, Gupte et al., 12 Oct 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TreePS-RAG.