Intermediate Retrieval Reward

Updated 4 July 2026

Intermediate retrieval reward is a mechanism that assigns credit to intermediate retrieval steps, queries, or evidence sets in multi-hop reasoning.
It separates process-level evaluation from outcome-only rewards to mitigate issues like reward hacking and redundant search.
Various formulations (e.g., distance-based signals, sufficiency judgments, and semantic information gain) offer practical strategies for optimizing retrieval in RAG.

Intermediate retrieval reward denotes a reward signal attached to retrieval-relevant intermediate states, reasoning steps, executable summaries, subqueries, or evidence sets in retrieval-augmented reasoning, rather than only to the final answer or final response. In recent work on retrieval-augmented generation, agentic retrieval, knowledge-graph retrieval, lexical query expansion, and retrieval-guided reasoning, the motivating diagnosis is consistent: outcome-only supervision leaves intermediate think-and-search behavior unobserved, weakens credit assignment, and can encourage reward hacking, degraded response quality, redundant search, or reasoning drift (Zhang et al., 23 May 2025, Wei et al., 12 Nov 2025, Li et al., 25 May 2026, He et al., 30 Jul 2025).

1. Scope, definition, and boundaries

In the contemporary literature, intermediate retrieval reward is defined operationally rather than by a single canonical formula. The rewarded object may be a reasoning step $T_i$ , an intermediate query, a partial lexical expansion, a retrieval trajectory on a knowledge graph, a sufficiency judgment over the accumulated evidence, or an executable summary that becomes the next query. What unifies these formulations is that the reward is computed before, or independently of, the final task outcome, and is intended to localize credit for retrieval behavior that would otherwise be absorbed into a single terminal scalar.

This distinguishes intermediate retrieval reward from outcome-only reward modeling for RAG. "RAG-Reward" explicitly defines its reward model over final generated responses conditioned on a fixed retrieval context and states that it does not define rewards for retrieval actions, per-step generation, or intermediate retrieval stages (Zhang et al., 22 Jan 2025). The same boundary is important for "LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization": its abstract states that a process-level reward module is designed to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation, but the supplied text does not include the method’s precise definitions, reward functions, algorithms, or experiments, so its exact notion of intermediate retrieval reward is not recoverable here (Zhang et al., 23 May 2025).

A further boundary concerns process reward models that use retrieval to improve evaluation rather than to reward retrieval actions themselves. "Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning" augments step-level PRM scoring with retrieval over similar questions and steps, but the retrieved context is used to improve intermediate reward estimation for mathematical reasoning, not to define a retrieval-action reward in a search environment (Zhu et al., 20 Feb 2025).

2. Principal reward formulations

The literature contains several recurring formulations, each tied to a different unit of control in the retrieval process.

Framework	Rewarded unit	Signal form
Bi-RAR (Wei et al., 12 Nov 2025)	Reasoning step $T_i$	Bidirectional step rewards $r_i^{\text{T-A}}$ , $r_i^{\text{T-Q}}$
DynaSearcher (Hao et al., 23 Jul 2025)	Search trajectory	$r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})$
TIRESRAG-R1 (He et al., 30 Jul 2025)	Full reasoning+retrieval trace $RD$	Binary sufficiency reward $R^S$
SubSearch (Petcu et al., 8 Apr 2026)	Subquery / decomposition	Answerability and decomposition rewards
InfoReasoner (Hu et al., 31 Jan 2026)	Retrieval action at step $t$	Synthetic semantic information gain
PRA (Sohn et al., 10 Apr 2026)	Current reasoning step $s_t$ with evidence $D_t$	$T_i$ 0

A representative stepwise formulation appears in Bi-RAR. For each step $T_i$ 1, it defines a step-to-answer distance $T_i$ 2 and a step-to-question distance $T_i$ 3, then converts them into rewards

$T_i$ 4

These step rewards are aggregated by cascading trajectory rewards $T_i$ 5 and $T_i$ 6, each gated by final correctness, so that early high-quality steps dominate later ones and long padded trajectories are discounted (Wei et al., 12 Nov 2025).

A trajectory-level but retrieval-specific formulation appears in DynaSearcher. It defines retrieval recall

$T_i$ 7

a retrieval penalty

$T_i$ 8

and combines them as

$T_i$ 9

Here $r_i^{\text{T-A}}$ 0 is the number of retrieval actions and $r_i^{\text{T-A}}$ 1 is the annotated number of hops, so the reward explicitly trades off evidence coverage against redundant search (Hao et al., 23 Jul 2025).

TIRESRAG-R1 defines the core retrieval-related signal as a sufficiency reward over the entire reasoning-and-retrieval trace: $r_i^{\text{T-A}}$ 2 The judgment is made by a locally deployed LLM that checks whether the gold answer can be inferred from the retrieved context alone, regardless of whether the model’s own final answer is correct. This makes sufficiency a retrieval-quality reward rather than an answer-only reward (He et al., 30 Jul 2025).

SubSearch replaces external process supervision with intrinsic rewards derived from the agent and retriever. Its subquery answerability reward is

$r_i^{\text{T-A}}$ 3

and its decomposition reward combines semantic coverage and in-group splitability: $r_i^{\text{T-A}}$ 4 This directly rewards well-targeted subqueries and decompositions that are collectively exhaustive and minimally redundant (Petcu et al., 8 Apr 2026).

InfoReasoner formulates intermediate retrieval reward as uncertainty reduction. Its practical reward is a synthetic semantic information gain

$r_i^{\text{T-A}}$ 5

where $r_i^{\text{T-A}}$ 6 is the correct semantic class, $r_i^{\text{T-A}}$ 7 is a fact-free baseline context, and $r_i^{\text{T-A}}$ 8 is the context after retrieval. The rewarded event is not mere document relevance but epistemic progress toward the correct answer (Hu et al., 31 Jan 2026).

3. Estimation mechanisms and reward grounding

Intermediate retrieval reward is grounded by several distinct estimation strategies. Bi-RAR uses an information-theoretic construction based on conditional normalized information distance derived from Kolmogorov complexity and approximated by language-model log-likelihoods: $r_i^{\text{T-A}}$ 9 This turns each step into an information-bearing object that can be evaluated both toward the answer and back to the question (Wei et al., 12 Nov 2025).

InfoReasoner instead estimates belief change from model outputs. It samples answers under a baseline context and a retrieval-augmented context, clusters them into semantic equivalence classes using bidirectional textual entailment, defines semantic entropy

$r_i^{\text{T-Q}}$ 0

and interprets retrieval as valuable when it concentrates probability mass onto the correct semantic class or lowers semantic entropy. The theoretical claims attached to this construction are non-negativity of expected information gain, telescoping additivity across steps, and channel monotonicity under Blackwell dominance (Hu et al., 31 Jan 2026).

STORM grounds intermediate reward directly in the retriever. For a partial generation $r_i^{\text{T-Q}}$ 1, it maps completed token spans into BM25 terms and scores the partial query by

$r_i^{\text{T-Q}}$ 2

The retriever is queried at every generation step, low-reward continuations are pruned, and the document-level retrieval metric becomes token-level supervision for lexical query expansion (Satouf et al., 9 Jun 2026).

GraphFlow treats knowledge-graph retrieval as a sequential decision process with only terminal reward $r_i^{\text{T-Q}}$ 3, then factorizes that outcome reward into intermediate states through a flow estimator $r_i^{\text{T-Q}}$ 4 and a learned process reward $r_i^{\text{T-Q}}$ 5. Its detailed-balance condition

$r_i^{\text{T-Q}}$ 6

and the resulting DBLE loss transform terminal supervision into local transition constraints. This allows the retrieval policy to sample trajectories in proportion to their reward without direct process-level labels (Yu et al., 18 Oct 2025).

A different supervision path appears in Reward-RAG. There, CriticGPT assigns relevance scores in $r_i^{\text{T-Q}}$ 7 to query-document pairs, a reward model $r_i^{\text{T-Q}}$ 8 is trained to approximate those scores, and the reward model then synthesizes positive and hard-negative retrieval pairs for contrastive fine-tuning of the encoder. This is intermediate retrieval supervision in the retriever-training sense, rather than an online step reward in a multi-step search trajectory (Nguyen et al., 2024).

4. Credit assignment and optimization schemes

Once an intermediate retrieval reward has been defined, the central technical problem is how to propagate it to the tokens or actions that produced the retrieval behavior. Much of the recent literature uses GRPO or PPO-style objectives, but the reward enters those objectives in structurally different ways.

Bi-RAR trains separate forward and backward policies with GRPO and then interpolates them in weight space: $r_i^{\text{T-Q}}$ 9 The cascading reward structure suppresses the marginal value of later steps after an early step has already achieved high alignment, which is intended to reduce over-searching and verbose trajectories (Wei et al., 12 Nov 2025).

SubSearch aggregates its outcome reward and intrinsic process rewards by an adaptive residual rule: $r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})$ 0 Because the intermediate term is multiplied by $r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})$ 1, it vanishes when the answer is correct, a design introduced to avoid penalizing correct answers for imperfect intermediate traces (Petcu et al., 8 Apr 2026).

TIRESRAG-R1 combines answer, sufficiency, thinking, and reflection rewards as

$r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})$ 2

with a decreasing schedule

$r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})$ 3

It further uses sufficiency-based difficulty-aware reweighting and a consistency penalty to prevent trajectories from receiving high process rewards while still yielding poor final answers (He et al., 30 Jul 2025).

RICE-PO addresses a different asymmetry: executable summaries or queries can be scored directly by the retriever, while latent reasoning cannot. It selects high-entropy summaries as anchors, creates local counterfactual branches, computes local summary rewards $r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})$ 4, estimates local sensitivity $r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})$ 5 and residual future variation $r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})$ 6, and propagates summary-level credit to the preceding reasoning span only when

$r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})$ 7

This yields a critic-free PPO-style update in which reasoning tokens receive localized credit only when reasoning-to-action influence is strong and future residual effects are stable (Li et al., 25 May 2026).

Process Reward Agents move the reward into inference rather than policy optimization. PRA uses a frozen policy $r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})$ 8, retrieves evidence $r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})$ 9 for a partial trace $RD$ 0, scores the current step by

$RD$ 1

and ranks beam-search trajectories by cumulative reward

$RD$ 2

The resulting search procedure prunes trajectories online instead of rewarding them only after completion (Sohn et al., 10 Apr 2026).

5. Applications and empirical behavior

In multi-hop RAG, intermediate retrieval reward is associated with consistent gains over outcome-only RL baselines. Bi-RAR reports average EM $RD$ 3 for Bi-RAR-Instruct versus $RD$ 4 for Search-R1-Instruct, and $RD$ 5 for Bi-RAR-Base versus $RD$ 6 for Search-R1-Base. The reported multi-hop gains are particularly large on 2Wiki, where Instruct EM rises from $RD$ 7 to $RD$ 8, described as $RD$ 9 relative, while the method also uses only $R^S$ 0 of the training data used by Search-R1 (Wei et al., 12 Nov 2025).

TIRESRAG-R1 reports that the full system outperforms Search-R1-Instruct on all four multi-hop datasets listed in the paper: HotpotQA $R^S$ 1 versus $R^S$ 2, 2WikiMultiHopQA $R^S$ 3 versus $R^S$ 4, MuSiQue $R^S$ 5 versus $R^S$ 6, and Bamboogle $R^S$ 7 versus $R^S$ 8. The paper also states that removing the sufficiency reward causes the largest decline and can push performance below naive GRPO, which it interprets as evidence that the model otherwise engages in reward hacking and neglects crucial external documents (He et al., 30 Jul 2025).

InfoReasoner reports average accuracy $R^S$ 9 for the 3B model, compared with $t$ 0 for Search-R1-3B-Instruct, and $t$ 1 for the 7B model, above the cited 7B baselines. The training analysis further reports that the method initially learns more slowly than Search-R1, then surpasses it, while generating responses about $t$ 2 shorter and more stable. This is consistent with the paper’s interpretation that semantic information gain supplies dense exploration guidance before final-answer reward becomes reliable (Hu et al., 31 Jan 2026).

In lexical retrieval, STORM converts BM25 feedback into token-level supervision and reports that STORM $t$ 3 at 8B reaches average out-of-domain BEIR nDCG@10 $t$ 4, compared with $t$ 5 for MuGI, $t$ 6 for W2P, and $t$ 7 for QUESTER. It also reports MIRACL average nDCG@10 $t$ 8 for STORM $t$ 9, versus $s_t$ 0 for BM25 and $s_t$ 1 for the best dense baseline mColBERT, and gives an 8B generation latency of approximately $s_t$ 2 s compared with $s_t$ 3 s for MuGI and $s_t$ 4 s for W2P (Satouf et al., 9 Jun 2026).

In retrieval agents that alternate reasoning and executable summaries, RICE-PO reports BRIGHT macro-average NDCG@10 $s_t$ 5 with DeepSeek-R1-Distill-Qwen-1.5B, compared with $s_t$ 6 for GRPO and $s_t$ 7 for Tree-GRPO, and $s_t$ 8 with Qwen3-4B-Thinking, above all cited GRPO-family baselines and above Diver’s reported average $s_t$ 9 under the same retriever family. On BEIR, it reports macro-average $D_t$ 0 with the 1.5B backbone and $D_t$ 1 with Qwen2.5-3B-Instruct, again above the listed group-based RL baselines (Li et al., 25 May 2026).

In domain-grounded reasoning, PRA reports $D_t$ 2 accuracy on MedQA with Qwen3-4B and states that it improves unseen frozen policies from $D_t$ 3B to $D_t$ 4B parameters by up to $D_t$ 5 points without any policy updates. The same paper reports an average gain of $D_t$ 6 points across seven medical benchmarks over its strongest baseline configuration (Sohn et al., 10 Apr 2026).

For knowledge-graph retrieval, GraphFlow reports that it outperforms strong KG-RAG baselines, including GPT-4o, by $D_t$ 7 on average in hit rate and recall on the STaRK benchmark, and the detailed tables in the paper show large gains in hit rate, MRR, recall, de-duplicated recall, and SePer-based retrieval utility over SFT and PRM baselines (Yu et al., 18 Oct 2025). DynaSearcher reports state-of-the-art answer accuracy on six multi-hop question answering datasets while using small-scale models and limited computational resources, and attributes this to dynamic knowledge graphs combined with multi-reward reinforcement learning over retrieval accuracy, efficiency, and response quality (Hao et al., 23 Jul 2025).

A common misconception is that any reward model used in RAG is an intermediate retrieval reward. The literature does not support that equivalence. "RAG-Reward" is explicitly generation-focused and outcome-based; its reward model judges final responses under fixed retrieval context and does not optimize retrieval actions, passages, or intermediate stages (Zhang et al., 22 Jan 2025). "Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling" introduces token-level temporal coherence and interpretable intermediate reward trajectories, but the training signal remains outcome-only and is not retrieval-specific (Nikulkov, 24 Apr 2026). "Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning" improves step-level reward estimation under step OOD and question OOD through retrieval over similar questions and steps, yet again this is retrieval-augmented evaluation, not an intermediate reward over retrieval decisions (Zhu et al., 20 Feb 2025).

A second misconception is that denser intermediate reward automatically guarantees better global behavior. The broader reinforcement-learning theory of intermediate rewards shows otherwise. In the one-way single-path setting, intermediate rewards reduce the number of synchronous value iterations needed to reach a successful policy from $D_t$ 8 to the maximum segment length between successive checkpoints, while preserving shortest-path behavior. In the one-way multi-path setting, however, the same reward can create a trade-off: learning becomes computationally cheaper, but the greedy policy may no longer follow the shortest path, and if intermediate states are not one-way the agent may prefer looping on the intermediate reward instead of reaching the goal (Zhai et al., 2021).

Recent retrieval work states similar concerns in retrieval-specific terms. Bi-RAR motivates bidirectional step rewards by pointing to reward hacking and degraded response quality under outcome-only supervision (Wei et al., 12 Nov 2025). SubSearch reports that a simple weighted sum of answer and intermediate rewards can penalize correct answers when intermediate rewards are low, and therefore favors residual or adaptive residual aggregation (Petcu et al., 8 Apr 2026). InfoReasoner notes that realized information gain can be negative when retrieval is misleading or confusing, and treats this as a desirable penalty signal rather than a violation of the framework (Hu et al., 31 Jan 2026).

The diversity of formulations also indicates that intermediate retrieval reward is not a singular object. It may mean a per-step distance-based score, a binary sufficiency judgment over retrieved evidence, a semantic information-gain increment, a BM25-based token-level pruning signal, a reward factorization over latent retrieval states, or a search-time step scorer over a frozen policy. This suggests that the term is best understood as a family of credit-assignment mechanisms for retrieval-centric reasoning, unified by the attempt to evaluate what the retrieval process is doing before the final answer is known.