LLM-Based Query Reformulation

Updated 21 November 2025

LLM-based query reformulation is a technique that uses pre-trained language models to convert user queries into semantically enriched variants for improved retrieval.
It employs methods such as ensemble prompting, context-steered expansion, and reward-based evolution, achieving improvements of up to 30% in metrics like nDCG and recall.
Systems are designed with modular prompt management, backend agnosticism, and rigorous versioning, ensuring reproducibility and scalability in real-world applications.

LLM-based query reformulation denotes a family of information retrieval (IR), conversational AI, and database systems techniques that employ large pre-trained LLMs to transform an original user query into a variant that improves retrieval, corrects ambiguity, bridges lexical/semantic mismatches, or optimizes execution. LLM-powered methods now underpin SOTA advances in open-domain IR, e-commerce search, conversational QA, SQL optimization, and knowledge graph exploration, thanks to their ability to synthesize semantically aligned, context-aware, and task-optimized query representations.

1. Core Principles and Taxonomy of LLM-Based Query Reformulation

LLM-based query reformulation is operationalized as the mapping of an initial query $q_0$ to a reformulated query $q_r=R(q_0)$ , where $R$ is an operator (usually implemented via prompt engineering, context injection, or policy-guided decoding) (Dhole et al., 27 May 2024, Bigdeli et al., 20 Nov 2025). The goal is to achieve improved task-specific metrics (retrieval accuracy, efficiency, relevance, diversity). Reformulation can be unsupervised (zero-shot), supervised, or utilize hybrid pipelines with retrieval, ranking, or reward-model feedback.

The main subcategories are:

Expansive Reformulation: Appending or generating expansions, keywords, or pseudo-documents (Query2Doc, Query2E, GenQR) for recall improvement (Bigdeli et al., 20 Nov 2025).
Ensemble Prompting / Clustering: Employing multiple paraphrased prompts, clustering generated queries by intent, and aggregating (GenQREnsemble, GenQRFusion, GenCRF) (Dhole et al., 4 Apr 2024, Dhole et al., 27 May 2024, Seo et al., 17 Sep 2024).
Context-Steered or Retrieval-Augmented Methods: Conditioning the reformulation on retrieved passages, prior conversation, or external knowledge (CSQE, LameR) (Bigdeli et al., 20 Nov 2025).
Conversational and Iterative Methods: Handling multi-turn, ambiguity-resolving, and context-sensitive refinement (AdaRewriter, proactive query management in CIS) (2506.01381, Yuan et al., 8 Apr 2025).
Rule-Guided or Plan-Rewriting for Databases: Proposing and sequencing query rewrite rules or plan transformations in SQL/KG/graph querying (LITHE, R-Bot, LLM-R2, LaPuda, InteracSPARQL) (Dharwada et al., 18 Feb 2025, Sun et al., 2 Dec 2024, Li et al., 19 Apr 2024, Wang et al., 20 Mar 2024, Jian et al., 3 Nov 2025, Li et al., 7 Jun 2024).
Reward-Based and Agentic Evolution: Reformulating under learned, simulated, or human-in-the-loop reward models, including RL or GA (OptAgent, RL-based Query Rewriting) (Handa et al., 4 Oct 2025, Nguyen et al., 29 Jan 2025).

2. Algorithmic Frameworks and Prompt Management

LLM-based query reformulation systems are characterized by modular, prompt-centric architectures. For example, the QueryGym toolkit (Bigdeli et al., 20 Nov 2025) formalizes this as four loosely coupled modules: data adapters, reformulation, retriever adapters, and prompt management. The workflow is:

Load queries and configuration (YAML or dict; supports CLI/Python API)
Optionally, wrap a retrieval backend (e.g., Pyserini, PyTerrier) in a BaseSearcher abstraction for retrieval-agnostic interaction
Choose a reformulation method (via BaseReformulator and registered subclasses—e.g., Query2Doc, GenQREnsemble, CSQE)
For each query in batch:
- Render a prompt from a centralized, versioned Prompt Bank (YAML with metadata)
- Optionally retrieve supporting context passages
- Invoke the LLM with specified parameters ({temperature, max_tokens,...})
- Parse LLM responses to produce the reformulated query
Store outputs and optionally evaluate using built-in IR metrics

Prompt management is a critical subsystem, supporting versioning, metadata (author, description, associated method), and custom templating with runtime Jinja-style variable insertion. This design ensures reproducibility and traceable lineage across experiments (Bigdeli et al., 20 Nov 2025).

3. Major Methodological Paradigms

a. Ensemble Prompting and Clustering

GenQREnsemble and GenQRFusion (Dhole et al., 4 Apr 2024, Dhole et al., 27 May 2024) instantiate ensemble prompting:

Generate $N$ keyword sets for each query, each under a paraphrased zero-shot instruction (e.g., $I_j$ for $j=1...N$ )
Aggregate all keywords or retrieve separately and fuse results, either by sum of BM25-style scores or reciprocal rank fusion:

$s_{RRF}(d) = \sum_{j=1}^N \frac{1}{k + \mathrm{rank}_j(d)}$

Relevance feedback can be incorporated by prepending retrieved passages to the prompt.

GenCRF (Seo et al., 17 Sep 2024) advances clustering-based intent diversification:

Generate $N$ queries each under distinct intent-exploring instructions (contextual, detail-specific, aspect-specific).
Cluster generated queries (K=1–3, LLM-driven or embedding-based) and synthesize representative queries per cluster.
Aggregate (via cosine similarity or LLM-based scoring) into the final reformulation for retrieval.
QERM, a reward classifier, triggers regeneration if retrieval effectiveness is subpar.

b. Query Expansion via Pseudo-Document Generation

Zero-shot generation of pseudo-documents or passages aligning with query intent is implemented in Query2Doc and Q2E (Bigdeli et al., 20 Nov 2025):

The prompt instructs the LLM to create a concise, informative passage reflecting the query's topic scope.
Expansion terms or longer “pseudo-docs” can be used either to enrich sparse term matching or as input for dense retrievers.

c. Reward-Based and Evolutionary Reformulation

OptAgent (Handa et al., 4 Oct 2025) applies multi-agent simulation to define an adaptive, human-aligned reward landscape for e-commerce QR:

Each candidate rewrite is scored by a population of LLM-based “shopping agents”, aggregating relevance and simulated purchasing engagement across agents with diverse decoding temperatures.
Evolutionary operators (crossover, mutation) refine rewrites, guided by aggregate reward:

$F(q) = w_{10}\,s_{10} + w_a\,s_a + w_p\,n$

Empirical improvements of up to $+21.98\%$ over original queries, with crossover dominating the optimization benefit.

RL-based pipelines combine offline teacher–student distillation with online RL using LLM-simulated or real feedback to improve coverage, diversity, and engagement (Nguyen et al., 29 Jan 2025).

4. Backend Integration and Evaluation

LLM-based reformulation frameworks are designed for retrieval and application backend agnosticism:

Through BaseSearcher abstractions, any IR engine can be used (BM25, dense, hybrid, or custom).
Built-in benchmarks include MS MARCO, BEIR, TREC, with standard metrics:
- Mean Average Precision (MAP)
- nDCG@ $k$
- Recall@ $k$

For query optimization (SQL, KG, SPARQL), systems such as LITHE, R-Bot, LLM-R2, LaPuda, and InteracSPARQL integrate LLMs to propose plan rewrites based on prompt-driven rule enumeration, stepwise application, or plan-guided iteration, always with rigorous semantic/syntactic verification by cost models and rule-based/logic-based equivalence checkers (Dharwada et al., 18 Feb 2025, Sun et al., 2 Dec 2024, Li et al., 19 Apr 2024, Wang et al., 20 Mar 2024, Jian et al., 3 Nov 2025).

Example experimental outcomes for IR methods reported by QueryGym (Bigdeli et al., 20 Nov 2025): | Method | MAP | nDCG@10 | Recall@100 | |:--------------|:------------:|:------------:|:------------:| | BM25 baseline | 0.109 | 0.302 | 0.612 | | Query2Doc | 0.128 (+17%) | 0.345 (+14%) | 0.657 (+7%) | | GenQR | 0.134 (+23%) | 0.359 (+19%) | 0.672 (+10%) | | CSQE | 0.142 (+30%) | 0.378 (+25%) | 0.690 (+13%) |

5. Practical Guidelines and Best Practices

Best practices for LLM-based query reformulation include:

Standardization of LLM configuration (seed, temperature, max_tokens) to ensure experiment reproducibility (Bigdeli et al., 20 Nov 2025).
Rigorous prompt versioning with tracked metadata to enable transparent benchmarking.
Retrieval-agnostic experimentation, supporting rapid backend swaps without modifying reformulation logic.
For context-based expansions (CSQE, LameR), tuning the number of retrieved context passages ( $k$ ) balances LLM inference cost against retrieval gains.
All outputs—reformulated queries, prompt versions, LLM configs—should be logged and versioned for downstream evaluation and reproducibility.

System extensibility is facilitated via:

BaseReformulator subclass registration for new reformulation algorithms
Drop-in BaseSearcher wrappers for new retrieval backends (Bigdeli et al., 20 Nov 2025)
Central versioned prompt repositories (YAML)

6. Limitations, Open Challenges, and Research Frontiers

While LLM-based QR systems now realize substantial retrieval and execution improvements, several challenges persist:

Cost and Latency: LLM inference incurs nontrivial expense, especially with large models or ensemble-based/multi-agent techniques (e.g., OptAgent’s 6,100 LLM calls per query) (Handa et al., 4 Oct 2025).
Non-determinism: Even with fixed seeds, stochasticity in LLM outputs leads to variance across trials.
Dependence on External APIs: Many systems depend on closed, costed API endpoints (OpenAI-compatible), which can affect reproducibility and scale.
Language Coverage: Most toolkits and empirical studies focus on English; extending frameworks to support robust multilingual reformulation remains an open avenue (Bigdeli et al., 20 Nov 2025).
Output Quality Control: Hallucination and factual drift are mitigated using context-grounding, stepwise evidence, structured prompt design, and self-reflection loops, but remain open research concerns.
Search and Rule Space Explosion: Especially for SQL/KG optimization, methods such as MCTS (LITHE), guided cost descent (LaPuda), and curriculum-contrastive demonstration selection (LLM-R2) help prune the combinatorial search space, but scaling to very complex queries remains non-trivial (Dharwada et al., 18 Feb 2025, Sun et al., 2 Dec 2024, Wang et al., 20 Mar 2024, Li et al., 19 Apr 2024).

Open research directions include:

Dynamic or “on-the-fly” prompt generation conditioned on query/task features
End-to-end differentiable pipelines for joint optimization of LLM, retriever, and reranker
Learning query- or session-specific reward models for adaptive reformulation strategy selection
Integration of real user feedback in RL or agent-based learning loops for further alignment
Efficient caching, distillation, or query batching to reduce LLM computation overhead

7. Representative System: QueryGym Toolkit

QueryGym (Bigdeli et al., 20 Nov 2025) represents the unification of the above principles as an open-source, modular toolkit providing:

A Python API and CLI for batch and interactive LLM-based reformulation
A retrieval-agnostic backend via a BaseSearcher abstraction, supporting integration with any IR system
Centralized, versioned prompt management supporting flexible experimentation with prompting recipes
Benchmark alignment with MS MARCO and BEIR suite, supporting standard IR metrics for fair and repeatable evaluation
Extensible implementations of leading QR approaches and seamless method/back-end addition via standardized registry mechanisms

Example pipeline (summarized as pseudocode):

from querygym import create_reformulator, wrap_pyserini_searcher
searcher = LuceneSearcher.from_prebuilt_index("msmarco-v1-passage")
wrapped = wrap_pyserini_searcher(searcher, answer_key="contents")
reform = create_reformulator(
    method_name="csqe",
    model="gpt-4",
    params={"searcher": wrapped, "retrieval_k": 10, "gen_passages": 5},
    llm_config={"temperature":1.0, "max_tokens":128}
)
results = reform.reformulate_batch(queries)

This encapsulates key aspects: retrieval-agnostic instantiation, prompt-driven reformulation, batch processing, and robust output logging.

LLM-based query reformulation has emerged as a principal innovation channel in IR and query optimization. Modular toolkits, ensemble and context-steered prompting, agentic and reward-based learning, and prompt management at scale combine to deliver measurable advances in retrieval metrics and query efficiency, while reproducibility and extensibility are facilitated by standardized APIs and versioned benchmarks (Bigdeli et al., 20 Nov 2025, Dhole et al., 27 May 2024, Dhole et al., 4 Apr 2024, Seo et al., 17 Sep 2024). Ongoing research continues to push boundaries in efficiency, contextual grounding, real-time adaptation, and domain generalization.