Papers
Topics
Authors
Recent
Search
2000 character limit reached

Intermediate Retrieval Reward

Updated 4 July 2026
  • Intermediate retrieval reward is a mechanism that assigns credit to intermediate retrieval steps, queries, or evidence sets in multi-hop reasoning.
  • It separates process-level evaluation from outcome-only rewards to mitigate issues like reward hacking and redundant search.
  • Various formulations (e.g., distance-based signals, sufficiency judgments, and semantic information gain) offer practical strategies for optimizing retrieval in RAG.

Intermediate retrieval reward denotes a reward signal attached to retrieval-relevant intermediate states, reasoning steps, executable summaries, subqueries, or evidence sets in retrieval-augmented reasoning, rather than only to the final answer or final response. In recent work on retrieval-augmented generation, agentic retrieval, knowledge-graph retrieval, lexical query expansion, and retrieval-guided reasoning, the motivating diagnosis is consistent: outcome-only supervision leaves intermediate think-and-search behavior unobserved, weakens credit assignment, and can encourage reward hacking, degraded response quality, redundant search, or reasoning drift (Zhang et al., 23 May 2025, Wei et al., 12 Nov 2025, Li et al., 25 May 2026, He et al., 30 Jul 2025).

1. Scope, definition, and boundaries

In the contemporary literature, intermediate retrieval reward is defined operationally rather than by a single canonical formula. The rewarded object may be a reasoning step TiT_i, an intermediate query, a partial lexical expansion, a retrieval trajectory on a knowledge graph, a sufficiency judgment over the accumulated evidence, or an executable summary that becomes the next query. What unifies these formulations is that the reward is computed before, or independently of, the final task outcome, and is intended to localize credit for retrieval behavior that would otherwise be absorbed into a single terminal scalar.

This distinguishes intermediate retrieval reward from outcome-only reward modeling for RAG. "RAG-Reward" explicitly defines its reward model over final generated responses conditioned on a fixed retrieval context and states that it does not define rewards for retrieval actions, per-step generation, or intermediate retrieval stages (Zhang et al., 22 Jan 2025). The same boundary is important for "LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization": its abstract states that a process-level reward module is designed to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation, but the supplied text does not include the method’s precise definitions, reward functions, algorithms, or experiments, so its exact notion of intermediate retrieval reward is not recoverable here (Zhang et al., 23 May 2025).

A further boundary concerns process reward models that use retrieval to improve evaluation rather than to reward retrieval actions themselves. "Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning" augments step-level PRM scoring with retrieval over similar questions and steps, but the retrieved context is used to improve intermediate reward estimation for mathematical reasoning, not to define a retrieval-action reward in a search environment (Zhu et al., 20 Feb 2025).

2. Principal reward formulations

The literature contains several recurring formulations, each tied to a different unit of control in the retrieval process.

Framework Rewarded unit Signal form
Bi-RAR (Wei et al., 12 Nov 2025) Reasoning step TiT_i Bidirectional step rewards riT-Ar_i^{\text{T-A}}, riT-Qr_i^{\text{T-Q}}
DynaSearcher (Hao et al., 23 Jul 2025) Search trajectory rgain=α⋅(rrecall−rpenalty)r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})
TIRESRAG-R1 (He et al., 30 Jul 2025) Full reasoning+retrieval trace RDRD Binary sufficiency reward RSR^S
SubSearch (Petcu et al., 8 Apr 2026) Subquery / decomposition Answerability and decomposition rewards
InfoReasoner (Hu et al., 31 Jan 2026) Retrieval action at step tt Synthetic semantic information gain
PRA (Sohn et al., 10 Apr 2026) Current reasoning step sts_t with evidence DtD_t TiT_i0

A representative stepwise formulation appears in Bi-RAR. For each step TiT_i1, it defines a step-to-answer distance TiT_i2 and a step-to-question distance TiT_i3, then converts them into rewards

TiT_i4

These step rewards are aggregated by cascading trajectory rewards TiT_i5 and TiT_i6, each gated by final correctness, so that early high-quality steps dominate later ones and long padded trajectories are discounted (Wei et al., 12 Nov 2025).

A trajectory-level but retrieval-specific formulation appears in DynaSearcher. It defines retrieval recall

TiT_i7

a retrieval penalty

TiT_i8

and combines them as

TiT_i9

Here riT-Ar_i^{\text{T-A}}0 is the number of retrieval actions and riT-Ar_i^{\text{T-A}}1 is the annotated number of hops, so the reward explicitly trades off evidence coverage against redundant search (Hao et al., 23 Jul 2025).

TIRESRAG-R1 defines the core retrieval-related signal as a sufficiency reward over the entire reasoning-and-retrieval trace: riT-Ar_i^{\text{T-A}}2 The judgment is made by a locally deployed LLM that checks whether the gold answer can be inferred from the retrieved context alone, regardless of whether the model’s own final answer is correct. This makes sufficiency a retrieval-quality reward rather than an answer-only reward (He et al., 30 Jul 2025).

SubSearch replaces external process supervision with intrinsic rewards derived from the agent and retriever. Its subquery answerability reward is

riT-Ar_i^{\text{T-A}}3

and its decomposition reward combines semantic coverage and in-group splitability: riT-Ar_i^{\text{T-A}}4 This directly rewards well-targeted subqueries and decompositions that are collectively exhaustive and minimally redundant (Petcu et al., 8 Apr 2026).

InfoReasoner formulates intermediate retrieval reward as uncertainty reduction. Its practical reward is a synthetic semantic information gain

riT-Ar_i^{\text{T-A}}5

where riT-Ar_i^{\text{T-A}}6 is the correct semantic class, riT-Ar_i^{\text{T-A}}7 is a fact-free baseline context, and riT-Ar_i^{\text{T-A}}8 is the context after retrieval. The rewarded event is not mere document relevance but epistemic progress toward the correct answer (Hu et al., 31 Jan 2026).

3. Estimation mechanisms and reward grounding

Intermediate retrieval reward is grounded by several distinct estimation strategies. Bi-RAR uses an information-theoretic construction based on conditional normalized information distance derived from Kolmogorov complexity and approximated by language-model log-likelihoods: riT-Ar_i^{\text{T-A}}9 This turns each step into an information-bearing object that can be evaluated both toward the answer and back to the question (Wei et al., 12 Nov 2025).

InfoReasoner instead estimates belief change from model outputs. It samples answers under a baseline context and a retrieval-augmented context, clusters them into semantic equivalence classes using bidirectional textual entailment, defines semantic entropy

riT-Qr_i^{\text{T-Q}}0

and interprets retrieval as valuable when it concentrates probability mass onto the correct semantic class or lowers semantic entropy. The theoretical claims attached to this construction are non-negativity of expected information gain, telescoping additivity across steps, and channel monotonicity under Blackwell dominance (Hu et al., 31 Jan 2026).

STORM grounds intermediate reward directly in the retriever. For a partial generation riT-Qr_i^{\text{T-Q}}1, it maps completed token spans into BM25 terms and scores the partial query by

riT-Qr_i^{\text{T-Q}}2

The retriever is queried at every generation step, low-reward continuations are pruned, and the document-level retrieval metric becomes token-level supervision for lexical query expansion (Satouf et al., 9 Jun 2026).

GraphFlow treats knowledge-graph retrieval as a sequential decision process with only terminal reward riT-Qr_i^{\text{T-Q}}3, then factorizes that outcome reward into intermediate states through a flow estimator riT-Qr_i^{\text{T-Q}}4 and a learned process reward riT-Qr_i^{\text{T-Q}}5. Its detailed-balance condition

riT-Qr_i^{\text{T-Q}}6

and the resulting DBLE loss transform terminal supervision into local transition constraints. This allows the retrieval policy to sample trajectories in proportion to their reward without direct process-level labels (Yu et al., 18 Oct 2025).

A different supervision path appears in Reward-RAG. There, CriticGPT assigns relevance scores in riT-Qr_i^{\text{T-Q}}7 to query-document pairs, a reward model riT-Qr_i^{\text{T-Q}}8 is trained to approximate those scores, and the reward model then synthesizes positive and hard-negative retrieval pairs for contrastive fine-tuning of the encoder. This is intermediate retrieval supervision in the retriever-training sense, rather than an online step reward in a multi-step search trajectory (Nguyen et al., 2024).

4. Credit assignment and optimization schemes

Once an intermediate retrieval reward has been defined, the central technical problem is how to propagate it to the tokens or actions that produced the retrieval behavior. Much of the recent literature uses GRPO or PPO-style objectives, but the reward enters those objectives in structurally different ways.

Bi-RAR trains separate forward and backward policies with GRPO and then interpolates them in weight space: riT-Qr_i^{\text{T-Q}}9 The cascading reward structure suppresses the marginal value of later steps after an early step has already achieved high alignment, which is intended to reduce over-searching and verbose trajectories (Wei et al., 12 Nov 2025).

SubSearch aggregates its outcome reward and intrinsic process rewards by an adaptive residual rule: rgain=α⋅(rrecall−rpenalty)r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})0 Because the intermediate term is multiplied by rgain=α⋅(rrecall−rpenalty)r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})1, it vanishes when the answer is correct, a design introduced to avoid penalizing correct answers for imperfect intermediate traces (Petcu et al., 8 Apr 2026).

TIRESRAG-R1 combines answer, sufficiency, thinking, and reflection rewards as

rgain=α⋅(rrecall−rpenalty)r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})2

with a decreasing schedule

rgain=α⋅(rrecall−rpenalty)r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})3

It further uses sufficiency-based difficulty-aware reweighting and a consistency penalty to prevent trajectories from receiving high process rewards while still yielding poor final answers (He et al., 30 Jul 2025).

RICE-PO addresses a different asymmetry: executable summaries or queries can be scored directly by the retriever, while latent reasoning cannot. It selects high-entropy summaries as anchors, creates local counterfactual branches, computes local summary rewards rgain=α⋅(rrecall−rpenalty)r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})4, estimates local sensitivity rgain=α⋅(rrecall−rpenalty)r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})5 and residual future variation rgain=α⋅(rrecall−rpenalty)r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})6, and propagates summary-level credit to the preceding reasoning span only when

rgain=α⋅(rrecall−rpenalty)r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})7

This yields a critic-free PPO-style update in which reasoning tokens receive localized credit only when reasoning-to-action influence is strong and future residual effects are stable (Li et al., 25 May 2026).

Process Reward Agents move the reward into inference rather than policy optimization. PRA uses a frozen policy rgain=α⋅(rrecall−rpenalty)r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})8, retrieves evidence rgain=α⋅(rrecall−rpenalty)r_{\text{gain}}=\alpha \cdot (r_{\text{recall}}-r_{\text{penalty}})9 for a partial trace RDRD0, scores the current step by

RDRD1

and ranks beam-search trajectories by cumulative reward

RDRD2

The resulting search procedure prunes trajectories online instead of rewarding them only after completion (Sohn et al., 10 Apr 2026).

5. Applications and empirical behavior

In multi-hop RAG, intermediate retrieval reward is associated with consistent gains over outcome-only RL baselines. Bi-RAR reports average EM RDRD3 for Bi-RAR-Instruct versus RDRD4 for Search-R1-Instruct, and RDRD5 for Bi-RAR-Base versus RDRD6 for Search-R1-Base. The reported multi-hop gains are particularly large on 2Wiki, where Instruct EM rises from RDRD7 to RDRD8, described as RDRD9 relative, while the method also uses only RSR^S0 of the training data used by Search-R1 (Wei et al., 12 Nov 2025).

TIRESRAG-R1 reports that the full system outperforms Search-R1-Instruct on all four multi-hop datasets listed in the paper: HotpotQA RSR^S1 versus RSR^S2, 2WikiMultiHopQA RSR^S3 versus RSR^S4, MuSiQue RSR^S5 versus RSR^S6, and Bamboogle RSR^S7 versus RSR^S8. The paper also states that removing the sufficiency reward causes the largest decline and can push performance below naive GRPO, which it interprets as evidence that the model otherwise engages in reward hacking and neglects crucial external documents (He et al., 30 Jul 2025).

InfoReasoner reports average accuracy RSR^S9 for the 3B model, compared with tt0 for Search-R1-3B-Instruct, and tt1 for the 7B model, above the cited 7B baselines. The training analysis further reports that the method initially learns more slowly than Search-R1, then surpasses it, while generating responses about tt2 shorter and more stable. This is consistent with the paper’s interpretation that semantic information gain supplies dense exploration guidance before final-answer reward becomes reliable (Hu et al., 31 Jan 2026).

In lexical retrieval, STORM converts BM25 feedback into token-level supervision and reports that STORMtt3 at 8B reaches average out-of-domain BEIR nDCG@10 tt4, compared with tt5 for MuGI, tt6 for W2P, and tt7 for QUESTER. It also reports MIRACL average nDCG@10 tt8 for STORMtt9, versus sts_t0 for BM25 and sts_t1 for the best dense baseline mColBERT, and gives an 8B generation latency of approximately sts_t2 s compared with sts_t3 s for MuGI and sts_t4 s for W2P (Satouf et al., 9 Jun 2026).

In retrieval agents that alternate reasoning and executable summaries, RICE-PO reports BRIGHT macro-average NDCG@10 sts_t5 with DeepSeek-R1-Distill-Qwen-1.5B, compared with sts_t6 for GRPO and sts_t7 for Tree-GRPO, and sts_t8 with Qwen3-4B-Thinking, above all cited GRPO-family baselines and above Diver’s reported average sts_t9 under the same retriever family. On BEIR, it reports macro-average DtD_t0 with the 1.5B backbone and DtD_t1 with Qwen2.5-3B-Instruct, again above the listed group-based RL baselines (Li et al., 25 May 2026).

In domain-grounded reasoning, PRA reports DtD_t2 accuracy on MedQA with Qwen3-4B and states that it improves unseen frozen policies from DtD_t3B to DtD_t4B parameters by up to DtD_t5 points without any policy updates. The same paper reports an average gain of DtD_t6 points across seven medical benchmarks over its strongest baseline configuration (Sohn et al., 10 Apr 2026).

For knowledge-graph retrieval, GraphFlow reports that it outperforms strong KG-RAG baselines, including GPT-4o, by DtD_t7 on average in hit rate and recall on the STaRK benchmark, and the detailed tables in the paper show large gains in hit rate, MRR, recall, de-duplicated recall, and SePer-based retrieval utility over SFT and PRM baselines (Yu et al., 18 Oct 2025). DynaSearcher reports state-of-the-art answer accuracy on six multi-hop question answering datasets while using small-scale models and limited computational resources, and attributes this to dynamic knowledge graphs combined with multi-reward reinforcement learning over retrieval accuracy, efficiency, and response quality (Hao et al., 23 Jul 2025).

A common misconception is that any reward model used in RAG is an intermediate retrieval reward. The literature does not support that equivalence. "RAG-Reward" is explicitly generation-focused and outcome-based; its reward model judges final responses under fixed retrieval context and does not optimize retrieval actions, passages, or intermediate stages (Zhang et al., 22 Jan 2025). "Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling" introduces token-level temporal coherence and interpretable intermediate reward trajectories, but the training signal remains outcome-only and is not retrieval-specific (Nikulkov, 24 Apr 2026). "Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning" improves step-level reward estimation under step OOD and question OOD through retrieval over similar questions and steps, yet again this is retrieval-augmented evaluation, not an intermediate reward over retrieval decisions (Zhu et al., 20 Feb 2025).

A second misconception is that denser intermediate reward automatically guarantees better global behavior. The broader reinforcement-learning theory of intermediate rewards shows otherwise. In the one-way single-path setting, intermediate rewards reduce the number of synchronous value iterations needed to reach a successful policy from DtD_t8 to the maximum segment length between successive checkpoints, while preserving shortest-path behavior. In the one-way multi-path setting, however, the same reward can create a trade-off: learning becomes computationally cheaper, but the greedy policy may no longer follow the shortest path, and if intermediate states are not one-way the agent may prefer looping on the intermediate reward instead of reaching the goal (Zhai et al., 2021).

Recent retrieval work states similar concerns in retrieval-specific terms. Bi-RAR motivates bidirectional step rewards by pointing to reward hacking and degraded response quality under outcome-only supervision (Wei et al., 12 Nov 2025). SubSearch reports that a simple weighted sum of answer and intermediate rewards can penalize correct answers when intermediate rewards are low, and therefore favors residual or adaptive residual aggregation (Petcu et al., 8 Apr 2026). InfoReasoner notes that realized information gain can be negative when retrieval is misleading or confusing, and treats this as a desirable penalty signal rather than a violation of the framework (Hu et al., 31 Jan 2026).

The diversity of formulations also indicates that intermediate retrieval reward is not a singular object. It may mean a per-step distance-based score, a binary sufficiency judgment over retrieved evidence, a semantic information-gain increment, a BM25-based token-level pruning signal, a reward factorization over latent retrieval states, or a search-time step scorer over a frozen policy. This suggests that the term is best understood as a family of credit-assignment mechanisms for retrieval-centric reasoning, unified by the attempt to evaluate what the retrieval process is doing before the final answer is known.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Intermediate Retrieval Reward.