Masking or Mitigating? Deconstructing the Impact of Query Rewriting on Retriever Biases in RAG

Published 7 Apr 2026 in cs.IR | (2604.06097v1)

Abstract: Dense retrievers in retrieval-augmented generation (RAG) systems exhibit systematic biases -- including brevity, position, literal matching, and repetition biases -- that can compromise retrieval quality. Query rewriting techniques are now standard in RAG pipelines, yet their impact on these biases remains unexplored. We present the first systematic study of how query enhancement techniques affect dense retrieval biases, evaluating five methods across six retrievers. Our findings reveal that simple LLM-based rewriting achieves the strongest aggregate bias reduction (54\%), yet fails under adversarial conditions where multiple biases combine. Mechanistic analysis uncovers two distinct mechanisms: simple rewriting reduces bias through increased score variance, while pseudo-document generation methods achieve reduction through genuine decorrelation from bias-inducing features. However, no technique uniformly addresses all biases, and effects vary substantially across retrievers. Our results provide practical guidance for selecting query enhancement strategies based on specific bias vulnerabilities. More broadly, we establish a taxonomy distinguishing query-document interaction biases from document encoding biases, clarifying the limits of query-side interventions for debiasing RAG systems.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates that simple LLM-based rewriting reduces aggregate dense retriever bias by up to 53.9% but may only mask bias by increasing score variance.
The study reveals that pseudo-document methods, especially with continual pretraining, achieve genuine decorrelation for brevity, literal, and position biases, though effects vary by retriever architecture.
The findings suggest that query-side enhancements must be tailored to specific bias profiles, as certain biases remain resistant even in adversarial scenarios.

Query Rewriting and Retriever Bias Mitigation in RAG: Mechanistic and Practical Insights

Background and Motivation

Retrieval-Augmented Generation (RAG) frameworks, which leverage dense retrievers to supply relevant context to LLMs, are widely deployed for knowledge-intensive applications. However, mounting evidence demonstrates that dense retrievers systematically prefer documents based on superficial, surface-level characteristics—such as brevity, position of answer content, literal lexical overlap, and repetition of key entities—rather than prioritizing genuine semantic relevance. These biases undermine retrieval fidelity and propagate errors into generative responses. As RAG pipelines typically contain query rewriting or enhancement stages (via LLM-based reformulation or pseudo-document expansion), a critical unanswered question is whether such interventions mask or genuinely mitigate retrieval biases.

Figure 1: Dense retrievers in RAG pipelines are vulnerable to several systematic biases; this work probes whether query rewriting can alleviate these failures.

Experimental Design and Methodology

The study isolates and quantifies retrieval biases using the ColDeR benchmark, which supplies controlled document pairs differing only in a specific surface attribute while preserving factual equivalence. The four primary bias types interrogated are:

Brevity bias: short documents are favored,
Literal matching bias: lexical overlap between query and document is prioritized,
Position bias: answer content appearing early in a document receives preferential scoring,
Repetition bias: documents with repeated entity mentions are ranked higher.

Six diverse dense retrievers (Contrastive Bi-Encoders, Generative Pretraining, Token-Level Interaction) are analyzed under five distinct query enhancement strategies:

Simple LLM-based rewriting,
HyDE (hypothetical document replacement),
Query2Doc (query augmentation via pseudo-documents),
HyDE-CPT (HyDE with continual pretraining using domain data),
Q2D-CPT (Query2Doc with continual pretraining).

Retrieval bias is measured by the mean $|t|$ -statistic over document pairs, reflecting the strength of retriever preference for biased characteristics. Complementary analyses examine feature-score correlations to distinguish variance-based bias masking from actual decorrelation.

Empirical Findings

Aggregate Bias Reduction

Simple LLM-based rewriting achieves the highest overall bias reduction (53.9% decrease in mean $|t|$ -statistic) across retrievers and bias types:

Baseline: 8.72,
Rewrite: 4.02,
Pseudo-document methods: 6.07–6.95.

However, pseudo-document generation methods (HyDE, Query2Doc, and CPT variants) yield more modest aggregate improvements, but exhibit differential bias mitigation profiles dependent on both retriever architecture and bias type.

Figure 2: Mean $|t|$ -statistic by bias type and enhancement method; simple rewriting robustly reduces bias but pseudo-document methods have bias-specific effects.

Architectural and Method-Level Bias Specificity

Bias reduction is highly retriever-dependent. For instance:

Query2Doc dramatically reduces bias for DRAGON/DRAGON+,
HyDE increases bias for CoCoDR,
Late interaction retriever ColBERTv2 is largely unaffected by enhancement method. Such findings reinforce the necessity of validating enhancement strategies for specific deployment retrievers.
Figure 3: Bias reduction by retriever and enhancement method; rewriting is consistently effective, pseudo-documents are heterogeneous across retrievers.

Resistance of Certain Biases to Query-Side Interventions

Brevity bias remains largely resilient to query rewriting, indicating deep embedding in document representations.
Literal matching bias is strongly reduced by all enhancement types, while repetition bias is surprisingly resistant to pseudo-document methods but reduced by rewriting.

Adversarial and Compound Bias Scenarios

In adversarial scenarios (Foil subset), where multiple biases are combined, simple rewriting fails to provide improved accuracy. Conversely, pseudo-document and CPT approaches achieve substantial robustness, particularly for CoCoDR, where HyDE-CPT yields a 13-fold increase in accuracy (26%).

Mechanistic Analysis

Spearman correlation analyses between score features and bias-inducing attributes reveal:

Rewriting lowers $|t|$ -statistic by increasing score variance, not by reducing sensitivity to bias features—thus masking rather than mitigating bias,
Pseudo-document methods achieve genuine decorrelation for brevity, literal, and position biases, indicating principled bias reduction,
All enhancement strategies increase susceptibility to repetition bias, reflecting a systematic failure mode.

Generalizability Across LLMs

Parallel experiments with Qwen3-4B-Instruct replicate the main findings observed for Gemma-3-12B-IT, confirming that enhancement mechanism trends are not artifacts of a particular LLM architecture or scale.

Figure 4: Complete bias analysis with Qwen3-4B-Instruct; results confirm method generalizability across LLMs.

Practical and Theoretical Implications

Practitioners must tailor query enhancement selection to retriever architecture and bias profile. Simple rewriting is computationally cheap but unreliable in adversarial settings; pseudo-document methods (especially CPT variants) provide genuine decorrelation and are preferable for high-stakes deployment despite added complexity.
Aggregate bias metrics can be misleading; reduction in observable bias scores may mask increased feature sensitivity. Reporting feature-score correlations is imperative for meaningful bias evaluation.
The taxonomy established in the study distinguishes query-document interaction biases (amenable to query-side mitigation) from document encoding biases (resistant to query modifications), necessitating retriever-level intervention for the latter.
The continuation of RAG paradigm optimization requires both improved retriever training objectives (to prioritize semantic evidence over surface heuristics) and more sophisticated query enhancement strategies that minimize introduction or reinforcement of new biases.

Conclusion

This work provides the first systematic deconstruction of the impact of query rewriting on dense retriever bias in RAG systems (2604.06097). It elucidates that while all enhancement methods reduce aggregate bias metrics, only pseudo-document generation with continual pretraining achieves robust mitigation through decorrelation. Simple rewriting, despite strong aggregate improvements, fails in adversarial scenarios and increases feature sensitivity, thereby masking underlying bias. The findings demand that researchers and practitioners transcend aggregate bias metrics, adopt feature-level correlation analyses, and exercise fine-grained strategy selection based on bias taxonomy and retriever architecture. The continued evolution of RAG must focus on engineering retrieval pipelines that are robust to superficial document and query characteristics and prioritize true semantic relevance.