- The paper introduces the Myopic Trap, a positional bias where retrieval models overvalue initial content and undervalue later relevant information.
- It employs a semantics-preserving evaluation framework, repurposing SQuAD v2 and FineWeb-edu to create position-aware benchmarks for IR.
- Findings reveal that while BM25 is robust, embedding and ColBERT-style models suffer from positional bias, which reranker models can effectively mitigate.
This paper introduces the concept of the "Myopic Trap" in Information Retrieval (IR), defining it as a specific positional bias where retrieval models disproportionately focus on the initial parts of documents, consequently undervaluing relevant information located later in the text. This phenomenon poses a practical challenge for IR systems, potentially leading to inaccurate relevance estimation.
To systematically quantify this bias, the authors propose a semantics-preserving evaluation framework. This framework repurposes existing NLP datasets, SQuAD v2 [18] and FineWeb-edu [6], into position-aware retrieval benchmarks.
- SQuAD-PosQ: Built from SQuAD v2, which provides character-level answer start positions. The dataset is filtered to include only answerable questions and then questions are grouped based on the answer's starting character index within the passage (e.g., 0-100, 100-200, 500+). A decline in retrieval performance for questions whose answers are located later in the document indicates positional bias.
- FineWeb-PosQ: Constructed using longer passages (500-1024 words) from FineWeb-edu. LLMs like gpt-4o-mini [16] are used to generate position-aware question-answer pairs, targeting specific segments (beginning, middle, end) of each passage. This dataset allows for evaluating positional bias in longer document contexts, addressing potential data leakage concerns associated with heavily used datasets like SQuAD.
The paper evaluates a range of state-of-the-art retrieval models across the full IR pipeline:
- BM25 [7]: A classical sparse retrieval method.
- Embedding Models: Including bge-m3-dense [10], stella_en_400M_v5 [8], text-embedding-3-large [17], voyage-3-large [13], gte-Qwen2-7B-instruct [20], and NV-embed-v2 [9]. These models encode queries and documents into single dense vectors.
- ColBERT-style Models: colbertv2.0 [22] and bge-m3-colbert [10]. These models use multi-vector representations and perform late interaction during scoring.
- Rerankers: bge-reranker-v2-m3 [11], gte-multilingual-reranker-base [21], and bge-reranker-v2-gemma [11]. These models use cross-attention for deep interaction between query and document, typically applied to a smaller candidate set.
Evaluation is performed using NDCG@10. Due to computational costs, ColBERT-style and reranker models are evaluated on smaller, validated subsets (SQuAD-PosQ-Tiny, FineWeb-PosQ-Tiny), while BM25 and embedding models are evaluated on the full datasets.
The experimental results reveal significant differences in positional bias across model types:
- BM25: Shows strong robustness to positional bias, with stable performance regardless of the relevant content's position. Its term-matching nature is position-agnostic.
- Embedding Models: Exhibit widespread vulnerability to the Myopic Trap. Performance consistently drops as relevant information moves later in the document. This bias is attributed partly to contrastive training processes. Notable exceptions are NV-embed-v2 and voyage-3-large, which show comparatively better robustness. Further analysis (Appendix C) confirms that embedding models' full-text representations have higher cosine similarity with the beginning segment than later segments.
- ColBERT-style Models: Also suffer from positional bias, indicating that late-stage token interaction doesn't fully eliminate the bias introduced during the encoding phase. However, the bge-m3-colbert model shows greater robustness compared to its single-vector counterpart, bge-m3-dense, suggesting that the ColBERT-style training approach might offer some mitigation potential.
- Reranker Models: Effectively mitigate the Myopic Trap. Their deep cross-attention mechanisms can precisely locate relevant content anywhere in the passage, neutralizing positional bias when used in downstream stages.
The practical implications of this paper are significant, especially for systems relying on accurate document retrieval, such as Retrieval-Augmented Generation (RAG). While first-stage retrievers (embedding and ColBERT-style models) are susceptible to the Myopic Trap, a subsequent reranking stage using models with strong interaction capabilities can substantially correct for this bias, improving the reliability and fairness of the overall system. The findings underscore the importance of considering positional bias when designing IR pipelines and highlight the need for future research into training embedding models that are less prone to this issue.
Limitations of the paper include its focus on English text, potential noise in LLM-generated synthetic data, and the lack of a theoretical explanation for the observed representational biases in embedding models. Future work aims to explore multilingual settings, improve data quality, and deepen the theoretical understanding of embedding behavior.