A Reproducibility Study of LLM-Based Query Reformulation

Published 30 Apr 2026 in cs.IR and cs.CL | (2604.27421v1)

Abstract: LLMs are now widely used for query reformulation and expansion in Information Retrieval, with many studies reporting substantial effectiveness gains. However, these results are typically obtained under heterogeneous experimental conditions, making it difficult to assess which findings are reproducible and which depend on specific implementation choices. In this work, we present a systematic reproducibility and comparative study of ten representative LLM-based query reformulation methods under a unified and strictly controlled experimental framework. We evaluate methods across two architectural LLM families at two parameter scales, three retrieval paradigms (lexical, learned sparse, and dense), and nine benchmark datasets spanning TREC Deep Learning and BEIR. Our results show that reformulation gains are strongly conditioned on the retrieval paradigm, that improvements observed under lexical retrieval do not consistently transfer to neural retrievers, and that larger LLMs do not uniformly yield better downstream performance. These findings clarify the stability and limits of reported gains in prior work. To enable transparent replication and ongoing comparison, we release all prompts, configurations, evaluation scripts, and run files through QueryGym, an open-source reformulation toolkit with a public leaderboard.\footnote{https://leaderboard.querygym.com}

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper’s main contribution is a controlled evaluation of 10 LLM-based query reformulation methods, highlighting reproducibility gaps across retrieval paradigms and datasets.
Methodologically, it standardizes experimental protocols using QueryGym to compare keyword-level, document-level, and corpus-grounded strategies with metrics like nDCG@10.
Results indicate that reformulation benefits are highly contingent on dataset, retrieval method, and LLM configuration, emphasizing significant interaction effects and stability challenges.

Authoritative Summary: "A Reproducibility Study of LLM-Based Query Reformulation" (2604.27421)

Introduction and Motivation

The paper systematically investigates LLM-based query reformulation within Information Retrieval (IR), specifically focusing on reproducibility and comparative effectiveness across heterogeneous evaluation configurations. LLMs have been increasingly deployed for query reformulation and expansion, promising improvements in retrieval effectiveness. However, prior studies lacked consistent experimental protocols, confounded variables including LLM backbone, parameter scale, retriever paradigm, and dataset choice, and hence made it difficult to assess which gains are reproducible or generalizable.

This work introduces a controlled evaluation of ten representative reformulation methods using a unified toolkit, QueryGym, across two LLM families (GPT-4.1 and Qwen2.5) at two parameter scales, three retrieval paradigms (BM25, SPLADE, BGE), and nine benchmark datasets (TREC Deep Learning and BEIR). The study holds decoding, prompting, and retrieval configurations constant, enabling head-to-head comparisons disentangled from hidden implementation variance.

Methodological Coverage and Evaluation Setup

The reformulation strategies evaluated span three major methodological families:

Keyword-Level Expansion: Methods (GenQR, GenQREnsemble, Q2K) prompt LLMs for additional keywords or phrases to augment the query, typically zero-shot with no corpus feedback.
Document-Level Expansion: Approaches (Query2Doc zero-shot (ZS), few-shot (FS), chain-of-thought (CoT), QA-Expand, MUGI) synthesize pseudo-documents, answer passages, or concatenations of generated sub-answers, enriching the semantic query signal.
Corpus-Grounded Expansion: Techniques (CSQE, LameR) anchor expansions in corpus-specific evidence (e.g., retrieval feedback, collection distribution) to align with dataset vocabulary and topical distribution.

Experiments are conducted via QueryGym, ensuring identical token limits, temperature, and sampling parameters for all models and methods. The evaluation leverages BM25 for lexical retrieval, SPLADE for learned sparse retrieval, and BGE for dense retrieval, all via Pyserini.

Comparative Analysis of Reformulation Methods

Under strictly controlled conditions (e.g., GPT-4.1 with BM25), document-level and corpus-grounded expansion strategies demonstrate substantial improvements in nDCG@10 and recall across TREC DL and BEIR datasets, outperforming classical expansion (RM3) and keyword-level approaches. Notably:

Document-level methods (e.g., MUGI, Q2D (FS)) consistently achieve highest nDCG@10, especially on challenging queries (DL-HARD) and domain-diverse datasets.
Chain-of-thought prompting does not yield further improvement and sometimes degrades performance, underscoring the importance of guided rather than unconstrained reasoning in expansion generation.
Corpus-grounded methods provide stable performance, particularly on adversarial or ambiguous queries, indicating domain-sensitive robustness.

Cross-Retriever Analysis and Paradigm Sensitivity

LLM-driven reformulation exhibits pronounced retriever-dependence:

Lexical Retrieval (BM25): Reformulation methods consistently produce significant gains in ranking metrics, both nDCG@10 and recall.
Learned Sparse Retrieval (SPLADE): Gains are reduced, with some expansion methods occasionally yielding negligible or negative changes. This reflects SPLADE's inherent neural expansion, which absorbs much of the benefit of explicit query-side enrichment.
Dense Retrieval (BGE): Surface-level lexical augmentations often fail to translate into meaningful embedding space improvements, and expansion can degrade performance on datasets where dense semantic encoding excels.

BM25 with effective LLM-based reformulation frequently achieves comparable effectiveness to state-of-the-art dense retrievers on unexpanded queries, offering a cost-effective alternative without vector index overhead.

Domain Robustness and Dataset Sensitivity

Effectiveness gains from reformulation vary markedly across datasets, query difficulty, and retriever configurations. Reformulation provides the largest improvements on DL-HARD and underspecified queries, but less robust gains or even regressions on certain domain-specific collections (FiQA, COVID).

Box plots of $\Delta$ nDCG@10 reveal wide interquartile ranges, particularly for BM25, emphasizing domain-sensitive volatility.
BGE shows limited and dataset-dependent benefit, indicating that dense retrieval spaces react differently to reformulation.
SPLADE's compressed gains reinforce model-side expansion redundancy.
Figure 1: Lower RankCV signals methods with stable ranking across datasets (generalists), while higher RankCV denotes strong domain sensitivity (specialists).

LLM Backbone, Scale, and Stability

The impact of architectural lineage and parameter scale is nuanced:

Within the GPT-4.1 family, both nano and full variants produce similar cross-domain profiles, with modest within-family scale sensitivity.
Qwen2.5 demonstrates greater scale-dependent variability, with Qwen2.5-72B outperforming Qwen2.5-7B by larger margins on several datasets.

Radar charts and Spearman rank correlations evidence that method ranking is not preserved across LLM backbones or scales: comparative claims about reformulation effectiveness are not portable unless scoped to the exact LLM configuration.

LLM $\times$ method interaction explains up to 22% of variance in nDCG@10 across datasets, comparable to LLM main effects. Method choice explains most aggregate variance, but substantial interaction effects mean that single-model evaluations overstate generalizability.
Rank agreement is unreliable within certain families (e.g., GPT-4.1 vs. GPT-4.1 nano), but more stable in Qwen2.5.
Figure 2: Radar charts show crossing lines and low rank correlations for GPT-4.1, illustrating instability; Qwen2.5 maintains stable method rankings.

Practical and Theoretical Implications

These findings clarify key constraints for reproducibility in LLM-based query reformulation:

Gains reported under lexical retrieval cannot be assumed to generalize to learned sparse or dense retrieval; paradigm-specific evaluation is essential.
Method effectiveness is highly conditional on dataset and query granularity, requiring multi-domain benchmarking.
The relationship between model scale and reformulation utility is method- and domain-dependent; increasing LLM capacity does not guarantee uniform benefit.
Comparative benchmarking is only valid per retrieval and LLM configuration, and must be contextualized given substantial interaction effects.

Practically, the study advocates for more rigorous standardization in evaluation pipelines, encourages open-source artifact release (QueryGym), and recommends reporting both aggregate effectiveness and cross-dataset stability (e.g., RankCV) for actionable reproducibility claims.

Conclusion

This controlled reproducibility study demonstrates that LLM-based query reformulation effectiveness is intricately conditioned on retrieval paradigm, dataset domain, LLM backbone, and method design. Substantial gains are possible but are not reliably portable across heterogeneous evaluation settings. Future IR research must incorporate controlled cross-paradigm and cross-domain protocols, validate comparative claims across multiple LLM scales and architectures, and report not only mean effectiveness but stability metrics. Open-source toolkits and leaderboards such as QueryGym are critical for transparent benchmarking and ongoing methodological development.

Markdown Report Issue