On-policy Pseudo-document Query Expansion

Updated 21 October 2025

The paper introduces OPQE, which generates synthetic passages using on-policy RL to enrich the original query for better retrieval outcomes.
It integrates prompt-based query expansion with RL-driven optimization, yielding enhanced metrics such as NDCG@10 and Hit@20 over traditional methods.
The study demonstrates reproducibility across benchmarks like MS MARCO and Natural Questions while offering a flexible framework for future IR advancements.

On-policy Pseudo-document Query Expansion (OPQE) is a modern query augmentation paradigm in information retrieval (IR) that leverages LLMs and reinforcement learning to generate synthetic passages—pseudo-documents—that encapsulate or expand the query intent for improved retrieval. Rather than simply rewriting the query, OPQE trains a policy, often in an RL framework, to output a pseudo-document whose concatenation with the original query forms a richer retrieval key, directly optimized for downstream ranking performance. This approach synergistically merges the generative flexibility of prompt-based models with the targeted optimization of RL, yielding robust gains over standalone prompting or RL-rewriting under diverse benchmarks (Xu et al., 20 Oct 2025).

1. Foundations and Motivation

OPQE builds on two foundational strands in retrieval augmentation:

Prompt-based Query Expansion: This technique instructs an LLM to produce either answer snippets or synthetic passages from its parametric knowledge given a query. The resulting pseudo-document is appended to or merged with the original query to mitigate vocabulary mismatch. While highly effective with modern LLMs, its generation is typically fixed at inference time and unresponsive to retrieval-centric objectives.
RL-based Query Rewriting: Here, the LLM is fine-tuned via reinforcement learning, with the objective of directly rewriting the query for maximum retrieval reward—metrics such as recall or NDCG form the RL reward. This approach adapts generation to retrieval context, but rewriting alone often fails to exploit the full expressive power of LLMs to synthesize context-rich evidence.

OPQE generalizes the RL approach by tasking the policy to generate the entire pseudo-document (not just a rewired query), thereby encoding more nuanced or multi-faceted aspects of the user's information need. The augmented input structure—original query concatenated with the RL-optimized pseudo-document—well supports retrieval via both sparse and dense methods (Xu et al., 20 Oct 2025).

2. Methodological Framework

The OPQE framework operates in three steps:

Pseudo-document Generation (Policy): The LLM policy $\pi_\theta$ is trained to generate a pseudo-document $d^H$ given the input query $q$ , using on-policy RL optimization. The goal is to maximize an expected retrieval reward, such as $r(q, d^H)$ , accounting for both the quality of the passage and its effect on retrieval.
Query Augmentation: The pseudo-document $d^H$ is concatenated with the query $q$ to form a composite retrieval input $(q, d^H)$ . This structure is passed to retrieval systems—e.g., BM25 or a dense retriever—thus leveraging additional context for improved matching.
Reward Function and RL Objective: Training employs an RL algorithm, most commonly PPO, with a reward defined as $R(\text{output}) = \text{FormatReward} \cdot \text{RetrievalReward}$ , where FormatReward enforces pseudo-document style constraints and RetrievalReward evaluates ranking performance, e.g., via NDCG@10, Recall@20, or Hit@20 as appropriate for the benchmark.

A prompt template controls output format, typically instructing the LLM to "write a concise Wikipedia-style passage" relevant to the query before concatenation. The generated pseudo-document must adhere to style and content requirements for maximal retrieval reward (Xu et al., 20 Oct 2025).

3. Benchmarking and Comparative Performance

OPQE is empirically evaluated against:

SPQE (Prompt-only Pseudo-document Expansion): Generates pseudo-documents using a fixed prompt without RL fine-tuning.
RL Query Rewriting: Directly optimizes a policy to rewrite the query with RL, without expanded synthetic passages.

Findings suggest that OPQE consistently outperforms both alternatives across evidence-seeking (Natural Questions, TriviaQA, SQuAD), ad hoc (MS MARCO, DL19/DL20, FEVER, etc.), and tool retrieval tasks. For ad hoc retrieval, OPQE achieves the highest average NDCG@10; for evidence-seeking, it secures robust improvements in Hit@20. In tool retrieval, OPQE maintains high Completeness@10, demonstrating effective coverage (Xu et al., 20 Oct 2025).

A notable result is that zero-shot prompting with powerful LLMs rivals or even surpasses RL rewriting (especially for answer-style expansion), but OPQE's hybrid method yields further gains by targeting RL signals toward pseudo-document generation rather than rewriting, combining generative structure with optimized retrieval cues (Xu et al., 20 Oct 2025).

4. Implementation Specifics

The OPQE training and inference pipeline involves:

Policy Training: RL via PPO, with a learning rate of $1 \times 10^{-6}$ , mini-batch sizes, and rollout temperatures specified in experimental details. The policy can use backbone models such as Qwen2.5-3B/7B or GPT-4o-mini, instantiated with explicit prompt templates.
Retrieval Environment: Pyserini (for BM25) and Faiss (for dense retrieval), benchmarking on standard datasets. The retrieval reward is computed based on the ranking produced by the augmented query input.
Prompt Design: Prompts clarify the expectation of a concise, Wikipedia-style passage. Prompt templates and rollout hyperparameters are presented fully in the paper's appendix for reproducibility.
RL Objective: The reward is a product of an indicator for format adherence and retrieval metric feedback from the IR environment.

All code and experimental setups are made available for replication and verification, with benchmarks covering Natural Questions, MS MARCO, FEVER, DL19/DL20, and ToolRet (Xu et al., 20 Oct 2025).

5. Reproducibility and Experimental Evidence

The OPQE methodology is rigorously evaluated:

Experimental Setup: Strict replication of prior RL-based and prompt-based query expansion methods (e.g., DeepRetrieval) under consistent conditions.
Model Variants: Multiple backbone models and parameter choices are examined, confirming robustness across architectures.
Metrics: Performance is reported using NDCG@10, Recall@10, Completeness@10, Hit@20, and related retrieval metrics as appropriate.
Transparency: All prompt templates, reward formulations, codebases, and training sequences are made publicly accessible to the research community.

This level of reproducibility supports the reported claims of OPQE’s performance advantages and provides a foundation for further benchmarking and extension (Xu et al., 20 Oct 2025).

6. Impact, Limitations, and Future Research Directions

OPQE sets a precedent for hybrid query augmentation strategies in IR. Its strength lies in merging LLM-driven generative passage synthesis with reward-focused RL optimization, producing query expansions that encode more semantically relevant and context-rich signals than either method alone.

Limitations include potential computational and latency overhead due to RL fine-tuning, reward sparsity (especially in multi-hop or deep retrieval settings), and the need for precise prompt and format design to avoid output drift. Further research directions entail:

Efficient Lightweight Expansion: Designing prompt-based expansions or RL policies that reduce dependency on large LLMs.
Refined RL Objective Functions: Incorporating more precise or composite reward signals, possibly integrating group-based or adaptive RL algorithms.
Multi-hop and Complex Retrieval Tasks: Extending augmentation to scenarios demanding layered reasoning or tool-based interaction.
Integrating Domain-adaptive Knowledge: Fusing retrieval-adaptive, domain-tailored expansion with OPQE for specialized applications.

The OPQE approach provides a reproducible, demonstrably effective blueprint for query expansion in modern LLM-augmented IR systems, setting the stage for future advances in on-policy, reward-optimized query augmentation paradigms (Xu et al., 20 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Rethinking On-policy Optimization for Query Augmentation (2025)

Follow Topic

Get notified by email when new papers are published related to On-policy Pseudo-document Query Expansion (OPQE).