Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking the Myopic Trap: Positional Bias in Information Retrieval (2505.13950v1)

Published 20 May 2025 in cs.IR

Abstract: This study investigates a specific form of positional bias, termed the Myopic Trap, where retrieval models disproportionately attend to the early parts of documents while overlooking relevant information that appears later. To systematically quantify this phenomenon, we propose a semantics-preserving evaluation framework that repurposes the existing NLP datasets into position-aware retrieval benchmarks. By evaluating the SOTA models of full retrieval pipeline, including BM25, embedding models, ColBERT-style late-interaction models, and reranker models, we offer a broader empirical perspective on positional bias than prior work. Experimental results show that embedding models and ColBERT-style models exhibit significant performance degradation when query-related content is shifted toward later positions, indicating a pronounced head bias. Notably, under the same training configuration, ColBERT-style approach show greater potential for mitigating positional bias compared to the traditional single-vector approach. In contrast, BM25 and reranker models remain largely unaffected by such perturbations, underscoring their robustness to positional bias. Code and data are publicly available at: www.github.com/NovaSearch-Team/RAG-Retrieval.

Summary

  • The paper introduces the Myopic Trap, a positional bias where retrieval models overvalue initial content and undervalue later relevant information.
  • It employs a semantics-preserving evaluation framework, repurposing SQuAD v2 and FineWeb-edu to create position-aware benchmarks for IR.
  • Findings reveal that while BM25 is robust, embedding and ColBERT-style models suffer from positional bias, which reranker models can effectively mitigate.

This paper introduces the concept of the "Myopic Trap" in Information Retrieval (IR), defining it as a specific positional bias where retrieval models disproportionately focus on the initial parts of documents, consequently undervaluing relevant information located later in the text. This phenomenon poses a practical challenge for IR systems, potentially leading to inaccurate relevance estimation.

To systematically quantify this bias, the authors propose a semantics-preserving evaluation framework. This framework repurposes existing NLP datasets, SQuAD v2 [18] and FineWeb-edu [6], into position-aware retrieval benchmarks.

  1. SQuAD-PosQ: Built from SQuAD v2, which provides character-level answer start positions. The dataset is filtered to include only answerable questions and then questions are grouped based on the answer's starting character index within the passage (e.g., 0-100, 100-200, 500+). A decline in retrieval performance for questions whose answers are located later in the document indicates positional bias.
  2. FineWeb-PosQ: Constructed using longer passages (500-1024 words) from FineWeb-edu. LLMs like gpt-4o-mini [16] are used to generate position-aware question-answer pairs, targeting specific segments (beginning, middle, end) of each passage. This dataset allows for evaluating positional bias in longer document contexts, addressing potential data leakage concerns associated with heavily used datasets like SQuAD.

The paper evaluates a range of state-of-the-art retrieval models across the full IR pipeline:

  • BM25 [7]: A classical sparse retrieval method.
  • Embedding Models: Including bge-m3-dense [10], stella_en_400M_v5 [8], text-embedding-3-large [17], voyage-3-large [13], gte-Qwen2-7B-instruct [20], and NV-embed-v2 [9]. These models encode queries and documents into single dense vectors.
  • ColBERT-style Models: colbertv2.0 [22] and bge-m3-colbert [10]. These models use multi-vector representations and perform late interaction during scoring.
  • Rerankers: bge-reranker-v2-m3 [11], gte-multilingual-reranker-base [21], and bge-reranker-v2-gemma [11]. These models use cross-attention for deep interaction between query and document, typically applied to a smaller candidate set.

Evaluation is performed using NDCG@10. Due to computational costs, ColBERT-style and reranker models are evaluated on smaller, validated subsets (SQuAD-PosQ-Tiny, FineWeb-PosQ-Tiny), while BM25 and embedding models are evaluated on the full datasets.

The experimental results reveal significant differences in positional bias across model types:

  • BM25: Shows strong robustness to positional bias, with stable performance regardless of the relevant content's position. Its term-matching nature is position-agnostic.
  • Embedding Models: Exhibit widespread vulnerability to the Myopic Trap. Performance consistently drops as relevant information moves later in the document. This bias is attributed partly to contrastive training processes. Notable exceptions are NV-embed-v2 and voyage-3-large, which show comparatively better robustness. Further analysis (Appendix C) confirms that embedding models' full-text representations have higher cosine similarity with the beginning segment than later segments.
  • ColBERT-style Models: Also suffer from positional bias, indicating that late-stage token interaction doesn't fully eliminate the bias introduced during the encoding phase. However, the bge-m3-colbert model shows greater robustness compared to its single-vector counterpart, bge-m3-dense, suggesting that the ColBERT-style training approach might offer some mitigation potential.
  • Reranker Models: Effectively mitigate the Myopic Trap. Their deep cross-attention mechanisms can precisely locate relevant content anywhere in the passage, neutralizing positional bias when used in downstream stages.

The practical implications of this paper are significant, especially for systems relying on accurate document retrieval, such as Retrieval-Augmented Generation (RAG). While first-stage retrievers (embedding and ColBERT-style models) are susceptible to the Myopic Trap, a subsequent reranking stage using models with strong interaction capabilities can substantially correct for this bias, improving the reliability and fairness of the overall system. The findings underscore the importance of considering positional bias when designing IR pipelines and highlight the need for future research into training embedding models that are less prone to this issue.

Limitations of the paper include its focus on English text, potential noise in LLM-generated synthetic data, and the lack of a theoretical explanation for the observed representational biases in embedding models. Future work aims to explore multilingual settings, improve data quality, and deepen the theoretical understanding of embedding behavior.