- The paper demonstrates that excessive query reformulations can harm performance due to semantic drift, and introduces ReformIR to mitigate this effect.
- It presents a two-stage pipeline that employs LLM-generated queries and a bandit-based surrogate model to optimize reformulation selection under budget constraints.
- Experimental results reveal that ReformIR significantly enhances recall and efficiency, achieving up to 33.48% recall gains and 3.3x–4.5x efficiency improvements over classic methods.
Motivation and Problem Definition
Modern retrieval pipelines increasingly employ query reformulation and neural reranking for improved recall and effectiveness. The proliferation of LLM-driven query expansion (GenQR, Query2Doc, QA-Expand, etc.) has shifted the focus towards multi-query generation in an attempt to circumvent the vocabulary mismatch inherent in sparse retrieval systems. However, generating numerous reformulations is not universally beneficial. While recall initially increases, indiscriminate reformulation and exhaustive reranking can introduce substantial query drift—degrading downstream ranking performance and amplifying computational costs. The challenge thus lies in budgeted, adaptive selection of reformulated queries and their candidates, mitigating semantic drift while maximizing recall.
The paper introduces ReformIR, a budget-aware retrieval framework that treats reformulations as latent features, optimized online via ranker feedback. ReformIR leverages a two-stage pipeline:
- Reformulation Generation: Multiple reformulated queries (Q1​,...,Qm​) are produced using LLMs or generative pipelines.
- Candidate Pool Construction: Each reformulation retrieves top-k documents, aggregated into a candidate pool C.
- Surrogate Model: Documents are scored via a feature vector comprising BM25 scores to each reformulation, the original query, and an RM3 pseudo-relevance feedback feature. This enables reformulation-level attribution within the model.
- Bandit-based Optimization: ReformIR implements online linear surrogate learning, treating each document as an "arm" in a multi-arm bandit, with reformulation features as predictors. The surrogate model prioritizes document selection for teacher reranking (MonoT5), anchored to the original query, under a strict inference budget.
- Iterative Feedback and Drift Suppression: As reranker scores are acquired, surrogate weights are updated to adaptively upweight reformulations and documents closest to the query's intent, actively suppressing drift.
Figure 1: ReformIR illustration—query reformulations, candidate document retrieval, surrogate optimization, and ranker feedback, highlighting drift suppression via adaptive weights.
Empirical Results and Analysis
Experimental evaluation on MSMARCO and TREC Deep Learning benchmarks (DL19–DL22) demonstrates several critical findings:
- Recall Performance: ReformIR consistently improves Recall@100 relative to classical reformulation (e.g., GenQR, QA-Expand, Query2Doc). Notably, when the number of reformulations increases (n=25,50), classical GenQR exhibits severe degradation due to drift, often falling below baseline BM25+MonoT5, while ReformIR maintains stable and superior recall.
- Statistical Significance: Augmenting GenQR with ReformIR on DL21 yields up to 33.48% absolute recall gains over GenQR alone; similar patterns are observed across reformulator baselines.
- Drift Robustness: ReformIR suppresses drift through learned feature attribution and surrogate reweighting. Reformulations that stray semantically are rapidly downweighted, preventing spurious documents from dominating the candidate pool.
- Model Scale and Efficiency: ReformIR delivers robust improvements with open-source LLMs ranging from 0.5B–32B parameters. Gains are consistent, and leveraging smaller LLMs upstream in the reformulation stage suffices when coupled with ReformIR. ReformIR is 3.3x–4.5x more efficient than LLM-based reranking, incurring only marginal overhead over classical methods.
- Interpretability: Explicit reformulation weights provide interpretable insight into which expansions are effectual, facilitating downstream transparency.
Strong and Contradictory Claims
- ReformIR contradicts the intuition that more reformulations invariably lead to better recall; excessive generation without adaptive feedback can severely harm performance through drift.
- The empirical analysis suggests that LLMs are better employed in upstream reformulation than in downstream reranking, especially under resource constraints.
- ReformIR obviates the need for exhaustive reranking or heuristic post-hoc fusion methods (e.g., RRF), offering sample-efficient adaptive optimization.
Practical and Theoretical Implications
Practically, ReformIR acts as a training-free adapter compatible with any query reformulation pipeline, substantially improving recall and stability at negligible latency cost. Theoretically, the formulation as a linear bandit-disposable optimization problem elevates reformulation selection to an active learning paradigm, grounding retrieval in ongoing, feedback-driven relevance estimation.
The methodology prompts a shift in future retrieval system design: emphasizing lightweight, feedback-driven optimization and upstream LLM reformulation to achieve stable recall gains under strict budgets. Additionally, the interpretability afforded by feature attribution aligns with contemporary explainable IR desiderata, facilitating analysis of query drift mechanics.
Future Directions
Potential extensions include online adaptation of the reformulators themselves using ranker feedback, further exploration of sub-0.5B scale LLMs for rapid reformulation, and the integration of corpus-level feedback for real-time drift detection. The broader application of online surrogate models for other retrieval sub-tasks is also promising.
Conclusion
The paper rigorously demonstrates that naive scaling of query reformulations can be detrimental due to semantic drift, especially under budget constraints. ReformIR provides a principled, feedback-driven framework for adaptive reformulation selection and drift suppression, yielding substantial gains in recall, efficiency, and interpretability across diverse IR benchmarks and reformulation strategies (2605.00560).