Papers
Topics
Authors
Recent
Search
2000 character limit reached

Retrieval-Aware Policy Optimization

Updated 4 July 2026
  • The paper demonstrates that retrieval-aware policy optimization leverages retrieval-conditioned feedback to improve downstream metrics (e.g., macro-F1) over literal exact-match retrieval.
  • It details methods like pre-action policy classification, sequential decision-making for adaptive stopping, and hybrid on/off-policy exploration to balance accuracy and cost.
  • Empirical findings reveal that adaptive retrieval strategies and embedding retrieval directly into policy loops significantly enhance decision quality across varied benchmarks.

to=arxiv_search 彩神争霸大发快_json_string: {"query":"Retrieval-aware Policy Optimization arXiv 2026", "max_results": 10} Retrieval-aware policy optimization denotes a family of methods in which retrieval is optimized, evaluated, or embedded according to its contribution to a downstream policy objective rather than treated as a fixed preprocessing stage or judged only by proxy retrieval metrics. In recent work, this idea appears in several forms: pre-action policy classification that replaces gold policy clauses with retrieved clauses at test time (Ding et al., 22 Jun 2026); retrieval-augmented generation with an explicit retrieval-relevance term inside preference optimization (Yan et al., 23 Jan 2025); adaptive document or policy-chunk selection formulated as sequential decision-making (Sharifullin et al., 6 Apr 2026, Hashemi et al., 17 Oct 2025); sequence-level reinforcement learning for multi-turn retrieval agents (Pan et al., 15 Jan 2026); hybrid on-policy/off-policy exploration for agentic reasoning (Zhang et al., 3 Mar 2026); semi-parametric imitation policies that retrieve expert neighbors at inference (Pfeifer et al., 8 Jun 2026); and continuous generative retrieval policies aligned to an online intersection metric via HPPO (Liu et al., 25 Jun 2026). Across these settings, the common pattern is optimization against end-task reward, action quality, or decision accuracy, rather than exact-match retrieval alone.

1. Conceptual scope and canonical formulations

A concise statement of the paradigm is given in work on long-horizon tool-use agents: instead of maximizing exact-match recall of a “gold” policy clause and hoping that this correlates with downstream decision quality, retrieval-aware policy optimization trains or evaluates the retriever by how well the downstream classifier performs when conditioned on retrieved rather than gold clauses (Ding et al., 22 Jun 2026). The same shift appears in RAG, where the policy is optimized not only for answer preference but also for an implicit representation of retrieval relevance (Yan et al., 23 Jan 2025), and in adaptive retrieval systems, where the retrieval process itself is the policy and receives reward for balancing correctness against retrieval cost (Sharifullin et al., 6 Apr 2026, Hashemi et al., 17 Oct 2025).

The resulting design space is broad but structurally coherent.

Setting Retrieval decision Optimized quantity
Pre-action policy classification Retrieved policy clause Macro-F1
Prior authorization Select chunk or STOP Accuracy and retrieval cost
Academic paper search Search/Expand tool calls Discounted sequence return
Retrieval-augmented generation Retrieval-conditioned response Preference loss with retrieval term
Imitation learning Retrieve expert neighbors Behavior-cloning objective
Generative retrieval Generate query embeddings Intersection density / Joint@K

Two canonical objectives illustrate the contrast. A recall-based retriever is optimized as

θrecall=argmaxθE(s,cgold)[1cgoldTop-k(pθ(s))],\theta_{\mathrm{recall}}=\arg\max_\theta \mathbb{E}_{(s,c_{\mathrm{gold}})}\big[1_{c_{\mathrm{gold}}\in \mathrm{Top}\text{-}k(p_\theta(\cdot|s))}\big],

whereas retrieval-aware optimization for policy classification is written as

θ=argminθE(s,a)D[Lcls(fϕ(s,cpθ(s)),a)].\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].

This formal distinction is explicit in (Ding et al., 22 Jun 2026). In related work, the same principle is realized through preference losses, value-based offline RL, sequence-level PPO variants, and semi-parametric architectures rather than through recall objectives alone (Yan et al., 23 Jan 2025, Sharifullin et al., 6 Apr 2026, Pan et al., 15 Jan 2026, Pfeifer et al., 8 Jun 2026, Liu et al., 25 Jun 2026).

2. Proxy retrieval metrics versus downstream policy signal

The sharpest empirical critique of proxy retrieval metrics comes from tau-bench policy classification. In that setup, the domains are τ\tau-bench-airline, with 122 policy clauses and 85 test states across 15 tasks, and τ2\tau^2-bench-retail, with 51 clauses and 40 test states. Retrievers include MiniLM, bge-large, e5-large, and a bge-reranker cross-encoder; classifiers are supervised fine-tuned Qwen2.5-3B and Qwen2.5-7B on a structured 4-field state, with a frozen-MiniLM logistic-regression probe for diagnostics (Ding et al., 22 Jun 2026).

The paper evaluates three-way macro-F1 over allow, verify, and refuse. Under gold-policy conditioning, a compact structured state improves macro-F1 over raw trajectories by 0.13-0.17 after tuning, and the reported representation ablation for Qwen-3B gives raw versus structured macro-F1 of 0.293 versus 0.601, with paired Δ=+0.308\Delta=+0.308 and 95% CI =[+0.237,+0.380]=[+0.237,+0.380]; structured versus raw+policy yields Δ=+0.171\Delta=+0.171 (Ding et al., 22 Jun 2026). At test time, the policy field is replaced by a top-1 retrieved clause, the gold clause, a mismatched clause, or no policy line.

The central result is that exact-match recall is low, but downstream policy signal remains high. MiniLM achieves recall@1 0.07\approx 0.07 and recall@5 0.16\approx 0.16 on airline, while even the cross-encoder reranker reaches recall@5 0.18\approx 0.18. Yet the direct retrieved-policy intervention with Qwen-3B yields gold policy θ=argminθE(s,a)D[Lcls(fϕ(s,cpθ(s)),a)].\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].0, MiniLM top-1 θ=argminθE(s,a)D[Lcls(fϕ(s,cpθ(s)),a)].\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].1 with θ=argminθE(s,a)D[Lcls(fϕ(s,cpθ(s)),a)].\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].2 and 95% CI θ=argminθE(s,a)D[Lcls(fϕ(s,cpθ(s)),a)].\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].3, and bge-large top-1 θ=argminθE(s,a)D[Lcls(fϕ(s,cpθ(s)),a)].\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].4 with θ=argminθE(s,a)D[Lcls(fϕ(s,cpθ(s)),a)].\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].5 and CI θ=argminθE(s,a)D[Lcls(fϕ(s,cpθ(s)),a)].\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].6; mismatched policy and no policy fall to θ=argminθE(s,a)D[Lcls(fϕ(s,cpθ(s)),a)].\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].7 and θ=argminθE(s,a)D[Lcls(fϕ(s,cpθ(s)),a)].\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].8 respectively (Ding et al., 22 Jun 2026).

These numbers support a narrow but important claim: in this benchmark configuration, exact-match clause recall can underestimate downstream utility. The paper does not establish non-inferiority, because the interval remains too wide, but it does not detect a macro-F1 difference between retrieved and gold clauses in the main configuration (Ding et al., 22 Jun 2026). This suggests that context-aligned non-gold clauses may carry substantial policy signal even when exact-match retrieval fails.

3. Retrieval as sequential control and adaptive stopping

A second line of work formulates retrieval itself as a decision process. In prior authorization, adaptive retrieval is modeled as an MDP in which the state is a 768-dimensional concatenation of a request embedding and the mean-pooled embedding of all retrieved chunks so far; the action space is θ=argminθE(s,a)D[Lcls(fϕ(s,cpθ(s)),a)].\theta^*=\arg\min_\theta \mathbb{E}_{(s,a^*)\sim D}\big[L_{\mathrm{cls}}(f_\phi(s,c\sim p_\theta(\cdot|s)),a^*)\big].9, where τ\tau0 selects a chunk from the top-τ\tau1 cosine-ranked candidates and τ\tau2 is STOP, with τ\tau3 and horizon τ\tau4 (Sharifullin et al., 6 Apr 2026). Each retrieval incurs step cost τ\tau5, and STOP yields τ\tau6 or τ\tau7 depending on oracle correctness, so total return is

τ\tau8

The offline dataset contains approximately 8,352 transitions from mixtures of Fixed-K, heuristic, and τ\tau9-greedy logging policies over 2,000 train episodes. Under τ2\tau^20, CQL attains 92.0% accuracy with 20.0 steps and return τ2\tau^21; BC matches CQL; IQL yields 62.5% accuracy with 3.4 steps and return τ2\tau^22; transition-level DPO attains 92.0% accuracy with 10.6 steps and return τ2\tau^23, occupying the reported “selective-accurate” region on the Pareto frontier (Sharifullin et al., 6 Apr 2026). A lambda ablation shows that only at τ2\tau^24 does CQL shift from exhaustive to selective retrieval, reducing steps from 20.0 to 14.9 with a 0.5 percentage-point accuracy drop (Sharifullin et al., 6 Apr 2026).

Cost-aware retrieval-augmented reasoning models extend the same principle to mixed reasoning-and-retrieval trajectories. The state is τ2\tau^25; actions are token generation, <search>, or <more info> with adaptive retrieval depth τ2\tau^26; retrieved documents are appended to the token history; and the agent terminates at </answer> or after a retrieval budget τ2\tau^27 (Hashemi et al., 17 Oct 2025). Two costs are defined: memory-bound total tokens, and latency-bound cost

τ2\tau^28

with τ2\tau^29 ms and Δ=+0.308\Delta=+0.3080 ms. The cost-aware advantage takes the form

Δ=+0.308\Delta=+0.3081

On seven public QA datasets, the reported outcome is an average exact-match increase from 43.1% to 48.2% and latency reductions of approximately 16–20%, with NQ dropping from 88.8 ms to 70.4 ms and MuSiQue from 103.6 ms to 87.6 ms under the latency-bound setting (Hashemi et al., 17 Oct 2025).

These formulations convert retrieval depth, stopping, and evidence accumulation into explicit policy variables. A plausible implication is that “retrieval-aware” optimization often concerns control over when and how much to retrieve, not only ranking quality at a single step.

4. Preference optimization, hybrid exploration, and sequence-level policy gradients

Several papers treat retrieval-aware optimization as a problem of aligning policy updates with retrieval-conditioned feedback. In RAG, Retrieval Preference Optimization derives a reward model

Δ=+0.308\Delta=+0.3082

where the second term is the implicit retrieval-relevance representation (Yan et al., 23 Jan 2025). The practical loss augments a DPO-style preference objective with a length-normalized retrieval term, using a “+” sign when the non-parametric answer is preferred and a “−” sign when the parametric answer is preferred. On PopQA, NQ, TriviaQA, and RGB, RaPO with LLaMA3-8B-instruct reports 65.4%, 51.9%, 74.4%, and 100.0% accuracy respectively, versus 59.0%, 41.3%, 65.8%, and 96.3% for RAG, with one LLM call at inference (Yan et al., 23 Jan 2025).

RAPO addresses a different failure mode: pure on-policy exploration in agentic RL. Its Hybrid-policy Agentic Rollout interleaves on-policy steps with retrieved off-policy step traces from a Step-Trace Buffer, using a 0.5/0.5 hybrid distribution for Δ=+0.308\Delta=+0.3083 (Zhang et al., 3 Mar 2026). Retrieval usefulness is quantified via an entropy-drop reward, and the policy update uses a combined advantage

Δ=+0.308\Delta=+0.3084

together with token-level importance shaping by the fraction of retrieved tokens. Across fourteen datasets, RAPO reports a +5.0% average gain and approximately 1.2× faster training efficiency, with rollout wall-time down 20%, policy-update time down 15%, total generated tokens down 18%, and tool-calls per step down 25% (Zhang et al., 3 Mar 2026).

PaperScout’s PSPO attacks a granularity mismatch between token-level PPO and multi-turn retrieval agents. The full retrieval trajectory is Δ=+0.308\Delta=+0.3085, where each Δ=+0.308\Delta=+0.3086 is the complete model response for turn Δ=+0.308\Delta=+0.3087, including tool calls and reasoning trace (Pan et al., 15 Jan 2026). PSPO treats each whole response as one atomic action, defines a sequence-level importance ratio

Δ=+0.308\Delta=+0.3088

and applies a clipped surrogate at that granularity. On RealScholarQuery, recall rises from 0.537 for PPO to 0.557 for GSPO and 0.574 for PSPO; LLM-score rises from 2.417 to 2.510 to 2.576; and training is reported to converge faster with smaller actor gradient norms and lower critic loss (Pan et al., 15 Jan 2026).

Taken together, these methods replace generic policy optimization with retrieval-conditioned feedback: retrieval relevance in RAG, retrieval-illuminating exploration in agentic RL, and sequence-level credit assignment in multi-turn search agents.

5. Semi-parametric retrieval policies and continuous generative retrievers

Retrieval-aware policy optimization also appears in architectures where retrieval is part of the policy representation. DARP reparameterizes imitation learning around local neighborhood structure rather than a global state-to-action map (Pfeifer et al., 8 Jun 2026). For a query state Δ=+0.308\Delta=+0.3089, the policy retrieves =[+0.237,+0.380]=[+0.237,+0.380]0 nearest expert demonstrations, forms offsets =[+0.237,+0.380]=[+0.237,+0.380]1, computes difference-aware proposals

=[+0.237,+0.380]=[+0.237,+0.380]2

and aggregates them as

=[+0.237,+0.380]=[+0.237,+0.380]3

The model is trained with the standard BC objective, without an additional smoothness hyperparameter. Across MuJoCo locomotion, Robosuite, RoboCasa, real FurnitureBench, vision-based manipulation with R3M embeddings, and Push-T, DARP improves over standard BC by 15–46% in return or success rate; performance rises sharply up to =[+0.237,+0.380]=[+0.237,+0.380]4, difference vectors are crucial, and removing =[+0.237,+0.380]=[+0.237,+0.380]5 drops success by 20–30% (Pfeifer et al., 8 Jun 2026).

A more explicitly retrieval-generative formulation is MO-DiT+HPPO. Here the policy =[+0.237,+0.380]=[+0.237,+0.380]6 is a continuous distribution over query embeddings in a frozen item-embedding space, generated by integrating a learned velocity field via flow-matching from Gaussian noise to =[+0.237,+0.380]=[+0.237,+0.380]7 (Liu et al., 25 Jun 2026). The true online objective is the intersection density

=[+0.237,+0.380]=[+0.237,+0.380]8

where =[+0.237,+0.380]=[+0.237,+0.380]9 measures how many top-Δ=+0.171\Delta=+0.1710 retrieved items simultaneously satisfy the target attribute and remain in the same pattern. HPPO constructs a hybrid candidate pool from static tail-centroid constructions and policy samples under several classifier-free guidance scales; labels winner/loser pairs by the online intersection metric; enforces a Pareto filter so winners do not lower same-pattern share; and applies a reference-anchored DPO-style loss plus an anchor loss to remain near the tail-centroid SFT solution (Liu et al., 25 Jun 2026).

The stage-wise empirical picture is explicit. Raw-sequence pretraining moves Attr@K and Joint@K from near-zero to approximately 6%; multi-domain metric-ordered CPT adds approximately 1–2 percentage points in Joint@K; tail-centroid SFT adds approximately +3–7 percentage points; and HPPO adds another approximately 6–12 percentage points on D1–D3, with paired-bootstrap Δ=+0.171\Delta=+0.1711 on 7/8 cells (Liu et al., 25 Jun 2026). Ordering sequences by ascending predicted density outperforms random or descending order by 2–4 percentage points in Joint@K, and iterating DPO without the Pareto filter collapses same-pattern purity (Liu et al., 25 Jun 2026).

These two lines differ sharply in representation—retrieved expert exemplars versus generated query embeddings—but both embed retrieval structure directly into policy computation rather than treating retrieval as a detached front end.

6. Empirical regularities, misconceptions, and unresolved issues

Several recurrent empirical regularities emerge across the literature. First, exact-match retrieval metrics can be weak surrogates for downstream control or classification quality: tau-bench policy classification shows near-oracle macro-F1 under retrieved clauses despite MiniLM recall@1 of approximately 0.07 (Ding et al., 22 Jun 2026). Second, fixed retrieval depth can be dominated by adaptive policies: in prior authorization, transition-level DPO matches 92.0% accuracy while using 47% fewer retrieval steps than exhaustive CQL or BC; in reasoning models, adaptive retrieval depth yields higher exact match together with lower latency (Sharifullin et al., 6 Apr 2026, Hashemi et al., 17 Oct 2025). Third, pure on-policy exploration is not always sufficient in agentic settings, motivating hybrid rollouts over retrieved traces (Zhang et al., 3 Mar 2026). Fourth, token-level RL can misalign with sequence-level retrieval interaction, motivating PSPO’s response-level action abstraction (Pan et al., 15 Jan 2026).

Several controversies or cautions are equally explicit. The tau-bench study states that it does not detect a macro-F1 difference between retrieved and gold clauses, but the confidence interval is too wide to establish non-inferiority (Ding et al., 22 Jun 2026). HPPO analyses show that off-policy one-round DPO can improve over SFT, but repeated iteration without the Pareto filter collapses same-pattern purity, which is described as reward hacking (Liu et al., 25 Jun 2026). DARP identifies extra retrieval overhead and the need for a meaningful distance metric over the state space as limitations, even while showing strong gains (Pfeifer et al., 8 Jun 2026). In healthcare-style prior authorization, the same study emphasizes three operating regimes—“Exhaustive,” “Efficient,” and “Selective-Accurate”—rather than one universally optimal retrieval strategy (Sharifullin et al., 6 Apr 2026).

A plausible synthesis is that retrieval-aware policy optimization is best understood not as a single algorithm but as a design doctrine: place retrieval inside the policy loop, expose it to downstream reward, and optimize the combined system at the granularity on which decisions are actually made. The published instantiations differ—classification F1, exact-match accuracy, Joint@K, sequence return, or imitation loss—but they converge on the same methodological claim: retrieval should be judged by what it enables the policy to do, not only by whether it reproduced a designated retrieval target (Ding et al., 22 Jun 2026, Yan et al., 23 Jan 2025, Sharifullin et al., 6 Apr 2026, Pan et al., 15 Jan 2026, Zhang et al., 3 Mar 2026, Pfeifer et al., 8 Jun 2026, Liu et al., 25 Jun 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Retrieval-aware Policy Optimization.