Retrieval-Aware Policy Optimization

Updated 4 July 2026

The paper demonstrates that retrieval-aware policy optimization leverages retrieval-conditioned feedback to improve downstream metrics (e.g., macro-F1) over literal exact-match retrieval.
It details methods like pre-action policy classification, sequential decision-making for adaptive stopping, and hybrid on/off-policy exploration to balance accuracy and cost.
Empirical findings reveal that adaptive retrieval strategies and embedding retrieval directly into policy loops significantly enhance decision quality across varied benchmarks.

to=arxiv_search 彩神争霸大发快_json_string: {"query":"Retrieval-aware Policy Optimization arXiv 2026", "max_results": 10} Retrieval-aware policy optimization denotes a family of methods in which retrieval is optimized, evaluated, or embedded according to its contribution to a downstream policy objective rather than treated as a fixed preprocessing stage or judged only by proxy retrieval metrics. In recent work, this idea appears in several forms: pre-action policy classification that replaces gold policy clauses with retrieved clauses at test time (Ding et al., 22 Jun 2026); retrieval-augmented generation with an explicit retrieval-relevance term inside preference optimization (Yan et al., 23 Jan 2025); adaptive document or policy-chunk selection formulated as sequential decision-making (Sharifullin et al., 6 Apr 2026, Hashemi et al., 17 Oct 2025); sequence-level reinforcement learning for multi-turn retrieval agents (Pan et al., 15 Jan 2026); hybrid on-policy/off-policy exploration for agentic reasoning (Zhang et al., 3 Mar 2026); semi-parametric imitation policies that retrieve expert neighbors at inference (Pfeifer et al., 8 Jun 2026); and continuous generative retrieval policies aligned to an online intersection metric via HPPO (Liu et al., 25 Jun 2026). Across these settings, the common pattern is optimization against end-task reward, action quality, or decision accuracy, rather than exact-match retrieval alone.

1. Conceptual scope and canonical formulations

A concise statement of the paradigm is given in work on long-horizon tool-use agents: instead of maximizing exact-match recall of a “gold” policy clause and hoping that this correlates with downstream decision quality, retrieval-aware policy optimization trains or evaluates the retriever by how well the downstream classifier performs when conditioned on retrieved rather than gold clauses (Ding et al., 22 Jun 2026). The same shift appears in RAG, where the policy is optimized not only for answer preference but also for an implicit representation of retrieval relevance (Yan et al., 23 Jan 2025), and in adaptive retrieval systems, where the retrieval process itself is the policy and receives reward for balancing correctness against retrieval cost (Sharifullin et al., 6 Apr 2026, Hashemi et al., 17 Oct 2025).

The resulting design space is broad but structurally coherent.

Setting	Retrieval decision	Optimized quantity
Pre-action policy classification	Retrieved policy clause	Macro-F1
Prior authorization	Select chunk or STOP	Accuracy and retrieval cost
Academic paper search	Search/Expand tool calls	Discounted sequence return
Retrieval-augmented generation	Retrieval-conditioned response	Preference loss with retrieval term
Imitation learning	Retrieve expert neighbors	Behavior-cloning objective
Generative retrieval	Generate query embeddings	Intersection density / Joint@K

Two canonical objectives illustrate the contrast. A recall-based retriever is optimized as

$\theta_{\mathrm{recall}}=\arg\max_\theta \mathbb{E}_{(s,c_{\mathrm{gold}})}\big[1_{c_{\mathrm{gold}}\in \mathrm{Top}\text{-}k(p_\theta(\cdot|s))}\big],$

whereas retrieval-aware optimization for policy classification is written as

$\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].$

This formal distinction is explicit in (Ding et al., 22 Jun 2026). In related work, the same principle is realized through preference losses, value-based offline RL, sequence-level PPO variants, and semi-parametric architectures rather than through recall objectives alone (Yan et al., 23 Jan 2025, Sharifullin et al., 6 Apr 2026, Pan et al., 15 Jan 2026, Pfeifer et al., 8 Jun 2026, Liu et al., 25 Jun 2026).

2. Proxy retrieval metrics versus downstream policy signal

The sharpest empirical critique of proxy retrieval metrics comes from tau-bench policy classification. In that setup, the domains are $\tau$ -bench-airline, with 122 policy clauses and 85 test states across 15 tasks, and $\tau^2$ -bench-retail, with 51 clauses and 40 test states. Retrievers include MiniLM, bge-large, e5-large, and a bge-reranker cross-encoder; classifiers are supervised fine-tuned Qwen2.5-3B and Qwen2.5-7B on a structured 4-field state, with a frozen-MiniLM logistic-regression probe for diagnostics (Ding et al., 22 Jun 2026).

The paper evaluates three-way macro-F1 over allow, verify, and refuse. Under gold-policy conditioning, a compact structured state improves macro-F1 over raw trajectories by 0.13-0.17 after tuning, and the reported representation ablation for Qwen-3B gives raw versus structured macro-F1 of 0.293 versus 0.601, with paired $\Delta=+0.308$ and 95% CI $=[+0.237,+0.380]$ ; structured versus raw+policy yields $\Delta=+0.171$ (Ding et al., 22 Jun 2026). At test time, the policy field is replaced by a top-1 retrieved clause, the gold clause, a mismatched clause, or no policy line.

The central result is that exact-match recall is low, but downstream policy signal remains high. MiniLM achieves recall@1 $\approx 0.07$ and recall@5 $\approx 0.16$ on airline, while even the cross-encoder reranker reaches recall@5 $\approx 0.18$ . Yet the direct retrieved-policy intervention with Qwen-3B yields gold policy $\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].$ 0, MiniLM top-1 $\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].$ 1 with $\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].$ 2 and 95% CI $\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].$ 3, and bge-large top-1 $\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].$ 4 with $\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].$ 5 and CI $\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].$ 6; mismatched policy and no policy fall to $\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].$ 7 and $\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].$ 8 respectively (Ding et al., 22 Jun 2026).

These numbers support a narrow but important claim: in this benchmark configuration, exact-match clause recall can underestimate downstream utility. The paper does not establish non-inferiority, because the interval remains too wide, but it does not detect a macro-F1 difference between retrieved and gold clauses in the main configuration (Ding et al., 22 Jun 2026). This suggests that context-aligned non-gold clauses may carry substantial policy signal even when exact-match retrieval fails.

3. Retrieval as sequential control and adaptive stopping

A second line of work formulates retrieval itself as a decision process. In prior authorization, adaptive retrieval is modeled as an MDP in which the state is a 768-dimensional concatenation of a request embedding and the mean-pooled embedding of all retrieved chunks so far; the action space is $\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].$ 9, where $\tau$ 0 selects a chunk from the top- $\tau$ 1 cosine-ranked candidates and $\tau$ 2 is STOP, with $\tau$ 3 and horizon $\tau$ 4 (Sharifullin et al., 6 Apr 2026). Each retrieval incurs step cost $\tau$ 5, and STOP yields $\tau$ 6 or $\tau$ 7 depending on oracle correctness, so total return is

$\tau$ 8

The offline dataset contains approximately 8,352 transitions from mixtures of Fixed-K, heuristic, and $\tau$ 9-greedy logging policies over 2,000 train episodes. Under $\tau^2$ 0, CQL attains 92.0% accuracy with 20.0 steps and return $\tau^2$ 1; BC matches CQL; IQL yields 62.5% accuracy with 3.4 steps and return $\tau^2$ 2; transition-level DPO attains 92.0% accuracy with 10.6 steps and return $\tau^2$ 3, occupying the reported “selective-accurate” region on the Pareto frontier (Sharifullin et al., 6 Apr 2026). A lambda ablation shows that only at $\tau^2$ 4 does CQL shift from exhaustive to selective retrieval, reducing steps from 20.0 to 14.9 with a 0.5 percentage-point accuracy drop (Sharifullin et al., 6 Apr 2026).

Cost-aware retrieval-augmented reasoning models extend the same principle to mixed reasoning-and-retrieval trajectories. The state is $\tau^2$ 5; actions are token generation, <search>, or <more info> with adaptive retrieval depth $\tau^2$ 6; retrieved documents are appended to the token history; and the agent terminates at </answer> or after a retrieval budget $\tau^2$ 7 (Hashemi et al., 17 Oct 2025). Two costs are defined: memory-bound total tokens, and latency-bound cost

$\tau^2$ 8

with $\tau^2$ 9 ms and $\Delta=+0.308$ 0 ms. The cost-aware advantage takes the form

$\Delta=+0.308$ 1

On seven public QA datasets, the reported outcome is an average exact-match increase from 43.1% to 48.2% and latency reductions of approximately 16–20%, with NQ dropping from 88.8 ms to 70.4 ms and MuSiQue from 103.6 ms to 87.6 ms under the latency-bound setting (Hashemi et al., 17 Oct 2025).

These formulations convert retrieval depth, stopping, and evidence accumulation into explicit policy variables. A plausible implication is that “retrieval-aware” optimization often concerns control over when and how much to retrieve, not only ranking quality at a single step.

4. Preference optimization, hybrid exploration, and sequence-level policy gradients

Several papers treat retrieval-aware optimization as a problem of aligning policy updates with retrieval-conditioned feedback. In RAG, Retrieval Preference Optimization derives a reward model

$\Delta=+0.308$ 2

where the second term is the implicit retrieval-relevance representation (Yan et al., 23 Jan 2025). The practical loss augments a DPO-style preference objective with a length-normalized retrieval term, using a “+” sign when the non-parametric answer is preferred and a “−” sign when the parametric answer is preferred. On PopQA, NQ, TriviaQA, and RGB, RaPO with LLaMA3-8B-instruct reports 65.4%, 51.9%, 74.4%, and 100.0% accuracy respectively, versus 59.0%, 41.3%, 65.8%, and 96.3% for RAG, with one LLM call at inference (Yan et al., 23 Jan 2025).

RAPO addresses a different failure mode: pure on-policy exploration in agentic RL. Its Hybrid-policy Agentic Rollout interleaves on-policy steps with retrieved off-policy step traces from a Step-Trace Buffer, using a 0.5/0.5 hybrid distribution for $\Delta=+0.308$ 3 (Zhang et al., 3 Mar 2026). Retrieval usefulness is quantified via an entropy-drop reward, and the policy update uses a combined advantage

$\Delta=+0.308$ 4

together with token-level importance shaping by the fraction of retrieved tokens. Across fourteen datasets, RAPO reports a +5.0% average gain and approximately 1.2× faster training efficiency, with rollout wall-time down 20%, policy-update time down 15%, total generated tokens down 18%, and tool-calls per step down 25% (Zhang et al., 3 Mar 2026).

PaperScout’s PSPO attacks a granularity mismatch between token-level PPO and multi-turn retrieval agents. The full retrieval trajectory is $\Delta=+0.308$ 5, where each $\Delta=+0.308$ 6 is the complete model response for turn $\Delta=+0.308$ 7, including tool calls and reasoning trace (Pan et al., 15 Jan 2026). PSPO treats each whole response as one atomic action, defines a sequence-level importance ratio

$\Delta=+0.308$ 8

and applies a clipped surrogate at that granularity. On RealScholarQuery, recall rises from 0.537 for PPO to 0.557 for GSPO and 0.574 for PSPO; LLM-score rises from 2.417 to 2.510 to 2.576; and training is reported to converge faster with smaller actor gradient norms and lower critic loss (Pan et al., 15 Jan 2026).

Taken together, these methods replace generic policy optimization with retrieval-conditioned feedback: retrieval relevance in RAG, retrieval-illuminating exploration in agentic RL, and sequence-level credit assignment in multi-turn search agents.

5. Semi-parametric retrieval policies and continuous generative retrievers

Retrieval-aware policy optimization also appears in architectures where retrieval is part of the policy representation. DARP reparameterizes imitation learning around local neighborhood structure rather than a global state-to-action map (Pfeifer et al., 8 Jun 2026). For a query state $\Delta=+0.308$ 9, the policy retrieves $=[+0.237,+0.380]$ 0 nearest expert demonstrations, forms offsets $=[+0.237,+0.380]$ 1, computes difference-aware proposals

$=[+0.237,+0.380]$ 2

and aggregates them as

$=[+0.237,+0.380]$ 3

The model is trained with the standard BC objective, without an additional smoothness hyperparameter. Across MuJoCo locomotion, Robosuite, RoboCasa, real FurnitureBench, vision-based manipulation with R3M embeddings, and Push-T, DARP improves over standard BC by 15–46% in return or success rate; performance rises sharply up to $=[+0.237,+0.380]$ 4, difference vectors are crucial, and removing $=[+0.237,+0.380]$ 5 drops success by 20–30% (Pfeifer et al., 8 Jun 2026).

A more explicitly retrieval-generative formulation is MO-DiT+HPPO. Here the policy $=[+0.237,+0.380]$ 6 is a continuous distribution over query embeddings in a frozen item-embedding space, generated by integrating a learned velocity field via flow-matching from Gaussian noise to $=[+0.237,+0.380]$ 7 (Liu et al., 25 Jun 2026). The true online objective is the intersection density

$=[+0.237,+0.380]$ 8

where $=[+0.237,+0.380]$ 9 measures how many top- $\Delta=+0.171$ 0 retrieved items simultaneously satisfy the target attribute and remain in the same pattern. HPPO constructs a hybrid candidate pool from static tail-centroid constructions and policy samples under several classifier-free guidance scales; labels winner/loser pairs by the online intersection metric; enforces a Pareto filter so winners do not lower same-pattern share; and applies a reference-anchored DPO-style loss plus an anchor loss to remain near the tail-centroid SFT solution (Liu et al., 25 Jun 2026).

The stage-wise empirical picture is explicit. Raw-sequence pretraining moves Attr@K and Joint@K from near-zero to approximately 6%; multi-domain metric-ordered CPT adds approximately 1–2 percentage points in Joint@K; tail-centroid SFT adds approximately +3–7 percentage points; and HPPO adds another approximately 6–12 percentage points on D1–D3, with paired-bootstrap $\Delta=+0.171$ 1 on 7/8 cells (Liu et al., 25 Jun 2026). Ordering sequences by ascending predicted density outperforms random or descending order by 2–4 percentage points in Joint@K, and iterating DPO without the Pareto filter collapses same-pattern purity (Liu et al., 25 Jun 2026).

These two lines differ sharply in representation—retrieved expert exemplars versus generated query embeddings—but both embed retrieval structure directly into policy computation rather than treating retrieval as a detached front end.

6. Empirical regularities, misconceptions, and unresolved issues

Several recurrent empirical regularities emerge across the literature. First, exact-match retrieval metrics can be weak surrogates for downstream control or classification quality: tau-bench policy classification shows near-oracle macro-F1 under retrieved clauses despite MiniLM recall@1 of approximately 0.07 (Ding et al., 22 Jun 2026). Second, fixed retrieval depth can be dominated by adaptive policies: in prior authorization, transition-level DPO matches 92.0% accuracy while using 47% fewer retrieval steps than exhaustive CQL or BC; in reasoning models, adaptive retrieval depth yields higher exact match together with lower latency (Sharifullin et al., 6 Apr 2026, Hashemi et al., 17 Oct 2025). Third, pure on-policy exploration is not always sufficient in agentic settings, motivating hybrid rollouts over retrieved traces (Zhang et al., 3 Mar 2026). Fourth, token-level RL can misalign with sequence-level retrieval interaction, motivating PSPO’s response-level action abstraction (Pan et al., 15 Jan 2026).

Several controversies or cautions are equally explicit. The tau-bench study states that it does not detect a macro-F1 difference between retrieved and gold clauses, but the confidence interval is too wide to establish non-inferiority (Ding et al., 22 Jun 2026). HPPO analyses show that off-policy one-round DPO can improve over SFT, but repeated iteration without the Pareto filter collapses same-pattern purity, which is described as reward hacking (Liu et al., 25 Jun 2026). DARP identifies extra retrieval overhead and the need for a meaningful distance metric over the state space as limitations, even while showing strong gains (Pfeifer et al., 8 Jun 2026). In healthcare-style prior authorization, the same study emphasizes three operating regimes—“Exhaustive,” “Efficient,” and “Selective-Accurate”—rather than one universally optimal retrieval strategy (Sharifullin et al., 6 Apr 2026).

A plausible synthesis is that retrieval-aware policy optimization is best understood not as a single algorithm but as a design doctrine: place retrieval inside the policy loop, expose it to downstream reward, and optimize the combined system at the granularity on which decisions are actually made. The published instantiations differ—classification F1, exact-match accuracy, Joint@K, sequence return, or imitation loss—but they converge on the same methodological claim: retrieval should be judged by what it enables the policy to do, not only by whether it reproduced a designated retrieval target (Ding et al., 22 Jun 2026, Yan et al., 23 Jan 2025, Sharifullin et al., 6 Apr 2026, Pan et al., 15 Jan 2026, Zhang et al., 3 Mar 2026, Pfeifer et al., 8 Jun 2026, Liu et al., 25 Jun 2026).