LLM-Guided Query Refinement

Updated 4 July 2026

LLM-guided query refinement is a family of methods where large language models modify queries, representations, or generation procedures to enhance retrieval and structured querying.
Prompt-only rewriting uses a single-step modification based solely on the original query, with effects measured by metrics like nDCG@10 and Recall@10 across various domains.
Retriever-aware and interactive refinement methods integrate feedback signals such as token attributions, schema constraints, and execution feedback to improve query embeddings and structured query accuracy.

Searching arXiv for papers on LLM-guided query refinement and closely related formulations. LLM-guided query refinement is a family of methods in which a LLM modifies a query, a query representation, or a query-generation procedure in order to improve downstream retrieval, structured querying, or search-agent behavior. Recent work spans prompt-only single-step rewriting for dense retrieval, retriever-aware reformulation using token attributions or retrieval feedback, interactive natural-language refinement of Cypher and SPARQL, reflective refinement for SQL, and domain-specific systems for multimodal search, e-commerce, satellite imagery, and multi-hop agents (Kotte, 2 Mar 2026, Garouani et al., 12 Feb 2026, Gera et al., 12 May 2026, Pusch et al., 5 Feb 2026, Mohr et al., 10 Jan 2026). Across these settings, the central issue is not whether an LLM can rewrite a query, but which signal constrains the rewrite: query text alone, retrieval context, model internals, execution feedback, or human feedback.

1. Scope and major paradigms

Recent literature treats query refinement as a broad design space rather than a single algorithmic pattern. In some systems the LLM rewrites the original text query directly; in others it edits a latent embedding, explains and amends a structured query, filters pseudo-relevance feedback evidence, or updates stage-level generation policies rather than the current query instance.

Paradigm	Core signal	Representative work
Prompt-only rewriting	Original query alone	(Kotte, 2 Mar 2026)
Retriever-aware refinement	Token attributions, top- $K$ judgments, PRF filtering, retrieval context	(Garouani et al., 12 Feb 2026, Gera et al., 12 May 2026, Otero et al., 16 Jan 2026, Bigdeli et al., 1 Apr 2026)
Interactive structured refinement	Schema, execution results, natural-language explanations, user amendments	(Pusch et al., 5 Feb 2026, Li et al., 2024, Jian et al., 3 Nov 2025)
Database query refinement	Typed stages, optimizer costs, skyline feedback, semantic verification	(Mohr et al., 10 Jan 2026, Hacohen et al., 17 Feb 2026, Dharwada et al., 18 Feb 2025)
Domain-specific and online adaptation	Process rewards, multimodal context, simulated feedback, threshold-free selection	(Wen et al., 8 Jan 2026, Nguyen et al., 29 Jan 2025, Arefeen et al., 6 May 2026, Zhang et al., 23 Sep 2025)

This taxonomy suggests that “LLM-guided query refinement” is best understood as a control problem over query transformations. Some methods operate entirely at inference time without retraining, whereas others use offline distillation, domain-adaptive post-training, or reinforcement learning. Some optimize lexical alignment to a corpus; others optimize semantic coverage of a schema or execution-level constraint satisfaction.

2. Prompt-only rewriting for dense retrieval

The most widely deployed variant is prompt-only, single-step query rewriting: a one-shot rewrite generated from the query alone, without retrieval feedback, click signals, or iterative refinement. A systematic study on BEIR using FiQA-2018, SciFact, and TREC-COVID, with MPNet and BGE under base and FiQA post-trained configurations, found strongly domain-dependent behavior: rewriting degrades mean nDCG@10 by $9.0\%$ on FiQA, improves it by $5.1\%$ on TREC-COVID, and has no significant effect on SciFact; Recall@10 mirrors the FiQA degradation at $-9.4\%$ ; lexical substitution occurs in $95\%$ of rewrites; and simple feature-based gating reduces worst-case regressions but does not reliably outperform never rewriting (Kotte, 2 Mar 2026).

The primary metric in that study is

$\mathrm{nDCG}@10=\frac{1}{Z}\sum_{i=1}^{10}\frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}.$

The paper’s central mechanism is lexical alignment. It defines the Vocabulary Overlap Ratio

$\mathrm{VOR}(q,D_q)=\frac{\left|\mathcal{W}(q)\cap \mathcal{W}(D_q)\right|}{\left|\mathcal{W}(q)\right|},$

where $\mathcal{W}(\cdot)$ is the set of unique, lowercased whitespace-tokenized unigrams and $\mathcal{W}(D_q)=\bigcup_{d\in D_q}\mathcal{W}(d)$ . On FiQA, rewriting significantly lowers VOR by $\Delta \mathrm{VOR}=-0.014$ with $9.0\%$ 0, which coincides with vocabulary drift away from stable finance jargon. On TREC-COVID, gains arise from nomenclature standardization not well captured by type-level overlap, and the Corpus Term Frequency ratio highlights large shifts toward corpus-preferred terminology, such as normalizing “2019-nCoV” to “COVID-19.”

A notable result is that substitution itself is not predictive: because substitution is nearly universal, effectiveness depends on substitution direction rather than substitution occurrence. FiQA failures include terminology drift, context injection, and over-specification; TREC-COVID improvements come from terminology harmonization and domain-term expansion. The same study also reports that minimal prompts that preserve technical terms, named entities, numbers, and acronyms reduce FiQA harm rates relative to aggressive expansion, and that low-temperature decoding does not eliminate degradation. The practical conclusion is correspondingly narrow: prompt-only rewriting may help in nomenclature-unstable domains, but in well-optimized verticals with stable jargon, domain-adaptive post-training is the safer intervention.

A second research line replaces query-only prompting with retriever-aware signals. In attribution-guided query rewriting, the retriever first retrieves top- $9.0\%$ 1 documents, then Integrated Gradients are computed for each query token and aggregated across the top- $9.0\%$ 2 set, yielding per-token attribution scores $9.0\%$ 3. These scores are passed to an LLM as soft guidance: high-attribution tokens should be preserved or emphasized, while low or negative attribution tokens should be clarified or disambiguated. On SciFact, FiQA-2018, and NFCorpus, with SPLADE and TCT-ColBERT, this method improves nDCG@10 by up to $9.0\%$ 4 over original queries for SPLADE and up to $9.0\%$ 5 for TCT-ColBERT, and by up to $9.0\%$ 6 and $9.0\%$ 7 respectively over LLM-only rewriting; the hard-threshold baseline that keeps only top-attribution tokens performs worst, indicating that low-attribution tokens may still encode essential context (Garouani et al., 12 Feb 2026).

Test-time embedding refinement goes further by modifying the query embedding rather than the surface form. A teacher LLM scores the top- $9.0\%$ 8 retrieved documents with yes/no relevance judgments, converts those scores into a softmax distribution, and optimizes the query embedding $9.0\%$ 9 so that the embedder distribution matches the teacher distribution under

$5.1\%$ 0

With $5.1\%$ 1, $5.1\%$ 2, Adam, and $5.1\%$ 3, this produces consistent MAP gains across literature search, intent detection, key-point matching, and instruction-following retrieval, with relative improvements up to $5.1\%$ 4 and an overall average improvement of about $5.1\%$ 5. The paper emphasizes that rerank-only baselines help less because they affect only the initial top- $5.1\%$ 6, whereas refined embeddings alter full-corpus ranking and improve recall beyond $5.1\%$ 7 (Gera et al., 12 May 2026).

Other retrieval-grounded formulations use the LLM primarily as a judge. LLM-assisted pseudo-relevance feedback inserts a binary or probability-weighted LLM filter before RM3 estimation, preserving corpus-grounded expansion vocabulary while reducing topic drift from noisy top- $5.1\%$ 8 documents; including narrative relevance instructions substantially improves AP@1000 and NDCG@100 relative to title-only prompting (Otero et al., 16 Jan 2026). Automatic in-domain exemplar construction and multi-LLM refinement of query expansions similarly keep the expansion anchored in target-corpus terminology by harvesting pseudo-relevant passages with a BM25-MonoT5 pipeline, selecting diverse medoid exemplars, and refining two heterogeneous LLM expansions into one coherent expansion; on TREC DL20, DBPedia, and SciFact, the refined ensemble consistently improves over BM25, Rocchio, zero-shot, and fixed few-shot baselines (Li et al., 9 Feb 2026).

A related direction makes the reformulation policy explicit rather than latent. ReFormeR first elicits short reformulation patterns from pairs of initial queries and stronger reformulations, consolidates them into a compact library, selects a pattern from retrieval context, and then applies a controlled rewrite. The resulting pattern set includes Clarify Intent, Clarify Subject, Conceptual Shift, Contextual Expansion, Contextual Restriction, Generalization, Location Specification, Purpose Specification, Semantic Clarification, and Temporal Adjustment, and the method reports consistent improvements on TREC DL 2019, DL 2020, and DL Hard over classical feedback and recent LLM-based reformulation and expansion baselines (Bigdeli et al., 1 Apr 2026). This suggests a broader shift from free-form rewriting toward policy-constrained refinement.

For knowledge-graph question answering, the problem is not merely to rewrite text, but to translate natural-language intent into a grounded structured query while preserving auditability. One framework centers the LLM in four roles—Generator, Explainer, Amender, and optionally Plausibility Checker—around a Neo4j backend. The workflow is iterative: generate Cypher from a natural-language question under schema constraints, execute it, explain it step by step, collect user feedback such as “wrong relationship direction” or “use SoftwarePackage, not Publication,” amend the query in place, and repeat. On a 90-query synthetic movie KG benchmark, one-sentence summary accuracy ranges from $5.1\%$ 9 to $-9.4\%$ 0 across models, fault detection on perturbed queries reaches $-9.4\%$ 1, and false-positive avoidance reaches $-9.4\%$ 2 for some models; real-life experiments on the MaRDI KG and a Hyena KG show that some models solve all tasks, but expert-written, niche concepts require more amendments and expose larger gaps among models (Pusch et al., 5 Feb 2026).

LinkQ implements a related but more visibly grounded protocol for SPARQL over Wikidata. The LLM does not answer directly; it interprets the user’s question, relies on system-side fuzzy search and graph traversal to resolve entity and property identifiers, generates a candidate SPARQL query only from verified IDs, and presents a preview through a query editor, an Entity-Relation Table, and a Query Graph before execution. A qualitative study with five KG practitioners reports that the interface is effective for targeted and exploratory analysis, and the complete natural-language-to-query conversion typically takes about $-9.4\%$ 3– $-9.4\%$ 4 seconds (Li et al., 2024).

InteracSPARQL makes the explanatory layer itself central to refinement. It first parses SPARQL into an abstract syntax tree, generates deterministic natural-language explanations from the AST by rules, then uses an LLM to polish these explanations without changing their structure. Refinement proceeds by executing the query, comparing it to the user’s question, gathering human or LLM feedback, grounding uncertain entities and properties with external search tools, and applying minimal AST transformations. On QALD-10 and QALD-9, self-refinement substantially improves over raw one-shot generation; for example, GPT-4o on QALD-10 rises from $-9.4\%$ 5 F1 to $-9.4\%$ 6, and with ground-truth natural-language explanations, performance approaches $-9.4\%$ 7– $-9.4\%$ 8 F1 depending on model and benchmark (Jian et al., 3 Nov 2025). The common implication across these systems is that explanations are not merely pedagogical; they function as editable intermediate representations for query repair.

In SQL settings, LLM-guided query refinement bifurcates into semantic correction and performance-oriented rewriting. Reflective reasoning for SQL generation decomposes text-to-SQL into typed stages—schema selection, value grounding, predicate extraction, aggregation semantics, semantic planning, and SQL realization—and applies feedback as persistent updates to stage-specific generation policies rather than to the current SQL instance. Evaluation combines interpreter-based checks with LLM-based semantic coverage verification, and the refinement loop localizes violations to a responsible stage before selectively restarting downstream components. On Spider, the full system reaches $-9.4\%$ 9 execution accuracy, with a GPT-5 variant at $95\%$ 0; on BIRD it reaches $95\%$ 1 EX and $95\%$ 2 VES, with most gains occurring by iteration $95\%$ 3 (Mohr et al., 10 Jan 2026).

OmniTune addresses a different problem: minimally modifying query predicates so that a refined SQL query satisfies arbitrary output constraints. It treats constraints as a deviation function $95\%$ 4, distance from the original query as $95\%$ 5, and solves

$95\%$ 6

Its two-step OPRO procedure first asks an LLM to propose a promising refinement subspace and then to sample concrete assignments within that subspace, with both stages guided by skyline summaries over $95\%$ 7 and concise history rather than raw logs. On 32 benchmark instances across Top-k, Range, Diversity, and Complex tasks, OmniTune achieves $95\%$ 8– $95\%$ 9 success and $\mathrm{nDCG}@10=\frac{1}{Z}\sum_{i=1}^{10}\frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}.$ 0– $\mathrm{nDCG}@10=\frac{1}{Z}\sum_{i=1}^{10}\frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}.$ 1 optimality, whereas direct LLM baselines are mostly below $\mathrm{nDCG}@10=\frac{1}{Z}\sum_{i=1}^{10}\frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}.$ 2 success and sometimes below random sampling (Hacohen et al., 17 Feb 2026).

Performance-oriented rewriting is represented by LITHE, which uses prompt ensembles, database-sensitive prompts, selectivity-based rules, and token-probability-guided Monte Carlo Tree Search to produce lean, semantically equivalent SQL rewrites. Candidate rewrites are syntax-checked by the DB parser, ranked by optimizer cost, and semantically verified by logic-based or sampling-based tests. On TPC-DS, LITHE finds productive rewrites for all $\mathrm{nDCG}@10=\frac{1}{Z}\sum_{i=1}^{10}\frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}.$ 3 feasible queries and reports a geometric mean speedup of $\mathrm{nDCG}@10=\frac{1}{Z}\sum_{i=1}^{10}\frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}.$ 4 versus $\mathrm{nDCG}@10=\frac{1}{Z}\sum_{i=1}^{10}\frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}.$ 5 for a state-of-the-art baseline; on PostgreSQL, the geometric mean runtime speedup for slow queries reaches $\mathrm{nDCG}@10=\frac{1}{Z}\sum_{i=1}^{10}\frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}.$ 6 over the native optimizer, compared with $\mathrm{nDCG}@10=\frac{1}{Z}\sum_{i=1}^{10}\frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}.$ 7 for the compared state of the art (Dharwada et al., 18 Feb 2025). Although this line targets database performance rather than retrieval effectiveness, it broadens the meaning of query refinement from relevance optimization to semantic-preserving operational improvement.

6. Specialized domains, online adaptation, and open problems

Several systems show that refinement behavior changes substantially once the query becomes multimodal, personalized, or embedded in a multi-step agent. Open-SAT addresses open-vocabulary object retrieval in satellite imagery by refining text embeddings with LLM-generated surrounding objects and a threshold-free decision rule that compares object-of-interest similarity against surrounding-object similarities. Using Remote-CLIP and GPT-4o with $\mathrm{nDCG}@10=\frac{1}{Z}\sum_{i=1}^{10}\frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}.$ 8 surrounding objects, it improves F1 by up to $\mathrm{nDCG}@10=\frac{1}{Z}\sum_{i=1}^{10}\frac{2^{\mathrm{rel}_i}-1}{\log_2(i+1)}.$ 9 while retrieving a comparable number of image tiles, with EuroSAT recall increasing by $\mathrm{VOR}(q,D_q)=\frac{\left|\mathcal{W}(q)\cap \mathcal{W}(D_q)\right|}{\left|\mathcal{W}(q)\right|},$ 0 points (Arefeen et al., 6 May 2026). In personalized product search, HMPPS uses perspective-guided summarization to compress product descriptions around query-relevant perspectives and a two-stage query-aware history filtering scheme based on multimodal representations. On Office Products, perspective-guided summarization reduces average per-product word count from $\mathrm{VOR}(q,D_q)=\frac{\left|\mathcal{W}(q)\cap \mathcal{W}(D_q)\right|}{\left|\mathcal{W}(q)\right|},$ 1 to $\mathrm{VOR}(q,D_q)=\frac{\left|\mathcal{W}(q)\cap \mathcal{W}(D_q)\right|}{\left|\mathcal{W}(q)\right|},$ 2, and in online deployment the system yields $\mathrm{VOR}(q,D_q)=\frac{\left|\mathcal{W}(q)\cap \mathcal{W}(D_q)\right|}{\left|\mathcal{W}(q)\right|},$ 3 query-CTR and $\mathrm{VOR}(q,D_q)=\frac{\left|\mathcal{W}(q)\cap \mathcal{W}(D_q)\right|}{\left|\mathcal{W}(q)\right|},$ 4 efficient click count (Zhang et al., 23 Sep 2025).

Search agents introduce another axis: refinement must improve not just one retrieval step, but an evolving reasoning trajectory. SmartSearch attaches process rewards to intermediate search queries through dual-level credit assessment, combining novelty and usefulness, then selectively refines low-quality queries and regenerates later search rounds. With a three-stage curriculum from imitation to alignment to generalization, it improves over prior search-agent baselines on 2WikiMultihopQA, HotpotQA, Bamboogle, MuSiQue, GAIA, and WebWalker; across the four knowledge-intensive datasets, average performance rises to $\mathrm{VOR}(q,D_q)=\frac{\left|\mathcal{W}(q)\cap \mathcal{W}(D_q)\right|}{\left|\mathcal{W}(q)\right|},$ 5 EM/F1 versus $\mathrm{VOR}(q,D_q)=\frac{\left|\mathcal{W}(q)\cap \mathcal{W}(D_q)\right|}{\left|\mathcal{W}(q)\right|},$ 6 for the runner-up StepSearch (Wen et al., 8 Jan 2026). In e-commerce search, a complementary approach distills an offline teacher LLM into a lightweight student rewrite model and then adapts it online with DPO using simulated LLM feedback on relevance, diversity, clicks, add-to-cart, and purchase. On Amazon ESCI, the distilled student improves relevance, diversity, and all three simulated engagement measures after online refinement (Nguyen et al., 29 Jan 2025).

Human-in-the-loop query generation remains important in domains where concept drift and multilingual retrieval make fully automatic reformulation unstable. The Query Generation Assistant couples docT5query, FlanT5, multilingual ColBERT-x retrieval, Reciprocal Rank Fusion, offline English translation, and event-span annotations in a Gradio interface. Users can edit generated queries directly or promote retrieved documents into the prompt as new exemplars through checkboxes, which turns relevance feedback into few-shot supervision for subsequent query generation (Dhole et al., 2023). This suggests that in difficult search settings, refinement is often as much an interface design problem as a model design problem.

Across these strands, several limits recur. Prompt-only rewriting can harm retrieval in stable-jargon domains; retriever-aware methods depend on the quality of initial top- $\mathrm{VOR}(q,D_q)=\frac{\left|\mathcal{W}(q)\cap \mathcal{W}(D_q)\right|}{\left|\mathcal{W}(q)\right|},$ 7 evidence; structured-query systems depend on schema clarity and robust grounding; SQL frameworks that edit only predicate constants do not solve structural query repair; and online adaptation raises latency, cost, and stability concerns (Kotte, 2 Mar 2026, Garouani et al., 12 Feb 2026, Pusch et al., 5 Feb 2026, Hacohen et al., 17 Feb 2026). A plausible implication is that the field is moving away from unconstrained rewriting toward refinement regimes with explicit control signals: lexical-alignment diagnostics, token attributions, skyline summaries, schema-grounded explanations, process rewards, and domain-specific context models.