When More Reformulations Hurt: Avoiding Drift using Ranker Feedback

Published 1 May 2026 in cs.IR | (2605.00560v1)

Abstract: Modern retrieval pipelines increasingly rely on query reformulation and neural reranking to improve effectiveness, but this comes at a significant computational cost and introduces a fundamental tradeoff between recall and query drift. Generating many reformulated queries can substantially increase recall, yet naively merging or exhaustively reranking their results is prohibitively expensive. In this work, we argue that the core challenge is not reformulation generation itself, but the adaptive selection of reformulations and their retrieved documents under a strict inference budget. We propose ReformIR, a budget-aware retrieval framework that treats query reformulations as first-class features and performs online relevance estimation using a strong reranker as a teacher. Given multiple reformulated queries, ReformIR constructs a large candidate pool and learns a lightweight surrogate model that estimates document utility from reformulation-specific retrieval signals. Under a fixed reranking budget, the surrogate adaptively prioritizes both reformulations and documents, selectively querying a teacher reranker anchored to the original query. This process increases recall while actively suppressing drift through online feature selection over reformulations. We conduct extensive experiments on the MSMARCO passage corpora and TREC Deep Learning benchmarks (DL19-DL22). Our results show that ReformIR consistently outperforms existing reformulation strategies, particularly as the number of reformulations increases, where prior methods suffer from severe quality degradation due to drift. Our findings also suggest a shift in retrieval system design, rather than using LLMs as rerankers, their capacity is more effectively leveraged in the reformulation stage with feedback-driven optimization.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that excessive query reformulations can harm performance due to semantic drift, and introduces ReformIR to mitigate this effect.
It presents a two-stage pipeline that employs LLM-generated queries and a bandit-based surrogate model to optimize reformulation selection under budget constraints.
Experimental results reveal that ReformIR significantly enhances recall and efficiency, achieving up to 33.48% recall gains and 3.3x–4.5x efficiency improvements over classic methods.

Adaptive Query Reformulation Under Budget: ReformIR and Drift Mitigation

Motivation and Problem Definition

Modern retrieval pipelines increasingly employ query reformulation and neural reranking for improved recall and effectiveness. The proliferation of LLM-driven query expansion (GenQR, Query2Doc, QA-Expand, etc.) has shifted the focus towards multi-query generation in an attempt to circumvent the vocabulary mismatch inherent in sparse retrieval systems. However, generating numerous reformulations is not universally beneficial. While recall initially increases, indiscriminate reformulation and exhaustive reranking can introduce substantial query drift—degrading downstream ranking performance and amplifying computational costs. The challenge thus lies in budgeted, adaptive selection of reformulated queries and their candidates, mitigating semantic drift while maximizing recall.

ReformIR Framework: Methodological Advances

The paper introduces ReformIR, a budget-aware retrieval framework that treats reformulations as latent features, optimized online via ranker feedback. ReformIR leverages a two-stage pipeline:

Reformulation Generation: Multiple reformulated queries ( $Q_1,...,Q_m$ ) are produced using LLMs or generative pipelines.
Candidate Pool Construction: Each reformulation retrieves top-k documents, aggregated into a candidate pool $\mathcal{C}$ .
Surrogate Model: Documents are scored via a feature vector comprising BM25 scores to each reformulation, the original query, and an RM3 pseudo-relevance feedback feature. This enables reformulation-level attribution within the model.
Bandit-based Optimization: ReformIR implements online linear surrogate learning, treating each document as an "arm" in a multi-arm bandit, with reformulation features as predictors. The surrogate model prioritizes document selection for teacher reranking (MonoT5), anchored to the original query, under a strict inference budget.
Iterative Feedback and Drift Suppression: As reranker scores are acquired, surrogate weights are updated to adaptively upweight reformulations and documents closest to the query's intent, actively suppressing drift.
Figure 1: ReformIR illustration—query reformulations, candidate document retrieval, surrogate optimization, and ranker feedback, highlighting drift suppression via adaptive weights.

Empirical Results and Analysis

Experimental evaluation on MSMARCO and TREC Deep Learning benchmarks (DL19–DL22) demonstrates several critical findings:

Recall Performance: ReformIR consistently improves Recall@100 relative to classical reformulation (e.g., GenQR, QA-Expand, Query2Doc). Notably, when the number of reformulations increases ( $n=25,50$ ), classical GenQR exhibits severe degradation due to drift, often falling below baseline BM25+MonoT5, while ReformIR maintains stable and superior recall.
Statistical Significance: Augmenting GenQR with ReformIR on DL21 yields up to 33.48% absolute recall gains over GenQR alone; similar patterns are observed across reformulator baselines.
Drift Robustness: ReformIR suppresses drift through learned feature attribution and surrogate reweighting. Reformulations that stray semantically are rapidly downweighted, preventing spurious documents from dominating the candidate pool.
Model Scale and Efficiency: ReformIR delivers robust improvements with open-source LLMs ranging from 0.5B–32B parameters. Gains are consistent, and leveraging smaller LLMs upstream in the reformulation stage suffices when coupled with ReformIR. ReformIR is 3.3x–4.5x more efficient than LLM-based reranking, incurring only marginal overhead over classical methods.
Interpretability: Explicit reformulation weights provide interpretable insight into which expansions are effectual, facilitating downstream transparency.

Strong and Contradictory Claims

ReformIR contradicts the intuition that more reformulations invariably lead to better recall; excessive generation without adaptive feedback can severely harm performance through drift.
The empirical analysis suggests that LLMs are better employed in upstream reformulation than in downstream reranking, especially under resource constraints.
ReformIR obviates the need for exhaustive reranking or heuristic post-hoc fusion methods (e.g., RRF), offering sample-efficient adaptive optimization.

Practical and Theoretical Implications

Practically, ReformIR acts as a training-free adapter compatible with any query reformulation pipeline, substantially improving recall and stability at negligible latency cost. Theoretically, the formulation as a linear bandit-disposable optimization problem elevates reformulation selection to an active learning paradigm, grounding retrieval in ongoing, feedback-driven relevance estimation.

The methodology prompts a shift in future retrieval system design: emphasizing lightweight, feedback-driven optimization and upstream LLM reformulation to achieve stable recall gains under strict budgets. Additionally, the interpretability afforded by feature attribution aligns with contemporary explainable IR desiderata, facilitating analysis of query drift mechanics.

Future Directions

Potential extensions include online adaptation of the reformulators themselves using ranker feedback, further exploration of sub-0.5B scale LLMs for rapid reformulation, and the integration of corpus-level feedback for real-time drift detection. The broader application of online surrogate models for other retrieval sub-tasks is also promising.

Conclusion

The paper rigorously demonstrates that naive scaling of query reformulations can be detrimental due to semantic drift, especially under budget constraints. ReformIR provides a principled, feedback-driven framework for adaptive reformulation selection and drift suppression, yielding substantial gains in recall, efficiency, and interpretability across diverse IR benchmarks and reformulation strategies (2605.00560).

Markdown Report Issue