Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers

Published 7 Apr 2026 in cs.IR | (2604.06163v1)

Abstract: Recent studies show that neural retrievers often display source bias, favoring passages generated by LLMs over human-written ones, even when both are semantically similar. This bias has been considered an inherent flaw of retrievers, raising concerns about the fairness and reliability of modern information access systems. Our work challenges this view by showing that source bias stems from supervision in retrieval datasets rather than the models themselves. We found that non-semantic differences, like fluency and term specificity, exist between positive and negative documents, mirroring differences between LLM and human texts. In the embedding space, the bias direction from negatives to positives aligns with the direction from human-written to LLM-generated texts. We theoretically show that retrievers inevitably absorb the artifact imbalances in the training data during contrastive learning, which leads to their preferences over LLM texts. To mitigate the effect, we propose two approaches: 1) reducing artifact differences in training data and 2) adjusting LLM text vectors by removing their projection on the bias vector. Both methods substantially reduce source bias. We hope our study alleviates some concerns regarding LLM-generated texts in information access systems.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper reveals that dataset artifacts, not neural retriever architecture, primarily drive LLM text bias in relevance-supervised settings.
Empirical and embedding-space analyses demonstrate that artifact imbalances induce non-semantic shortcuts, distorting result rankings.
Mitigation strategies, including training-time artifact control and inference-time projection, effectively reduce LLM bias with minimal retrieval quality loss.

Data Artifacts as the Determinant of Source Bias in Neural Retrievers

Problematic Nature of Source Bias in IR Systems

LLMs have established a dual-source text ecosystem, resulting in hybrid corpora containing both human-written and LLM-generated documents. Empirically, relevance-supervised neural retrievers exhibit consistent source bias, systematically favoring LLM-generated passages over semantically similar human-written content. This phenomenon is consequential in IR: it has the potential to distort result rankings, downregulate authentic human sources, and reinforce LLM outputs via retrieval system feedback loops. These behaviors not only undermine fairness but could also jeopardize the validity and utility of open-domain IR systems.

Attribution of Source Bias: Dataset Artifacts vs. Model Architecture

The prevailing hypothesis attributed the bias primarily to architectural similarities between PLMs and LLMs or to inherent preferences for low-perplexity text due to distributional shifts and scoring shortcuts. However, analysis across model families reveals that this bias is not intrinsic to neural retrieval architectures. The source bias is sharply potentiated only in relevance-supervised settings. General-purpose embedding models and unsupervised retrievers (e.g., SimCSE, Contriever) exhibit negligible or dataset-dependent biases. When these same unsupervised models are fine-tuned with retrieval-specific supervision (such as MS MARCO), a pronounced shift toward LLM text preference emerges. Thus, the model family itself is not a sufficient condition for source bias; instead, artifact imbalances in supervised datasets drive the effect.

Mechanistic Analysis: Linguistic Artifact Imbalances

Detailed linguistic analyses demonstrate that positive documents in retrieval datasets (i.e., those labeled as relevant) systematically differ from negative ones along several non-semantic axes. Specifically, positives have significantly lower perplexity and higher lexical specificity (IDF), paralleling the statistical properties of LLM-generated texts. This is a direct consequence of curation processes: positives are more fluent and condensed because they are intentionally drawn from edited, high-quality sources, whereas negatives are more diverse, noisy, and contain more disfluent or generic patterns. This artifact imbalance is not idiosyncratic to MS MARCO but holds across a variety of IR datasets.

Embedding Space: Directional Alignment Between Supervision and Source

In the embedding space, the displacement vector separating positives and negatives coincides with the vector separating LLM and human document embeddings. This alignment is robust across datasets and retriever architectures. Statistical analysis shows that the LLM–human embedding direction forms a coherent, stable axis, and that this same direction is learned as a shortcut for relevance ranking due to artifact correlations in the supervision signal. This embedding-space analysis is supported by a theoretical decomposition: the retriever’s scoring function acquires an artifact-dependent term proportional to the dataset-induced imbalance, and this term is linearly decodable in the embedding space.

Theoretical Results: Decomposition and Causality

Contrastive learning under artifact imbalance induces a Bayes-optimal scorer that decomposes into semantic and artifact-based components. If the positive and negative samples are statistically divergent in artifact features, the optimal solution unavoidably encodes these distinctions—even if they are uncorrelated with underlying semantic relevance. General-purpose and unsupervised objectives, by virtue of their symmetric and artifact-independent constructions, do not admit consistent artifact-dependent shortcuts.

Source Bias Mitigation: Data- and Embedding-Level Interventions

Two mitigation strategies are proposed and validated:

Training-time Artifact Control: Adoption of in-batch only negative sampling (i.e., constructing negatives solely from the annotated positive pool of other queries) suppresses non-semantic artifact imbalances. This yields a dramatic reduction in source bias (average ANDSR@5 shift from -0.099 to -0.024) at the expense of a small drop in retrieval effectiveness.
Inference-time Projection: For deployment, one can estimate the direction associated with LLM-induced artifacts and remove each candidate's embedding’s projection onto this vector prior to scoring. This simple linear intervention yields significant bias reduction and negligible loss in retrieval quality, with no retraining required. The effect is systematic across datasets and retriever types.

Both strategies corroborate the artifact hypothesis and are readily available for practical integration into IR systems that operate in LLM-hybrid collections.

Broader Implications and Future Directions

The result that source bias is not an intrinsic property of neural retrievers but is induced by artifact imbalances in retrieval supervision has several implications. By highlighting the causal pathway from dataset curation to bias emergence, the work motivates a shift toward artifact-controlled dataset construction for IR. As LLM-generated and human-written texts continue to coexist and proliferate, retrieval system design must carefully consider both the statistical makeup of relevance annotations and the consequences of model-data feedback loops. The embedding-space alignment between data artifacts and LLM-induced features may generalize to other modalities and IR-related tasks, prompting a wider investigation into the effects of stylistic and non-semantic supervision artifacts in deep learning pipelines.

Conclusion

This study rigorously disentangles the origin of source bias in neural retrievers, providing compelling evidence that artifact imbalances in supervised datasets—rather than model architectures or intrinsic objectives—are the primary causal factor. The paper presents robust linguistic, empirical, embedding-space, and theoretical corroboration. It introduces and validates actionable mitigation strategies at both the data and model level. These findings underscore the necessity of de-artifacted supervision for robust, fair IR and illuminate a clear path for future research into dataset composition and debiasing in AI systems operating in mixed text-source domains (2604.06163).

Markdown Report Issue