Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 76 tok/s

Gemini 2.5 Pro 59 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes (2012.14210v2)

Published 28 Dec 2020 in cs.IR and cs.CL

Abstract: Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. We show that this behavior is tightly connected to the number of dimensions of the representations: The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents.

Citations (55)

View on Semantic Scholar

Summary

The paper demonstrates that dense retrieval models rapidly lose performance in large indices due to a higher likelihood of false positives.
Empirical results using the MS MARCO dataset reveal that dense models outperform sparse techniques on small indices but lag as index size increases.
The study highlights that techniques like hard negatives and hybrid strategies can partially offset the limitations of dense representations in extensive search collections.

Insights on Dense Low-Dimensional Information Retrieval with Large Index Sizes

The paper by Reimers and Gurevych provides a rigorous analysis of the efficacy of dense low-dimensional representations for information retrieval when handling large index sizes. While dense vector representations have shown potential to outperform traditional sparse representations like BM25 in smaller datasets, this research explores the limits of such performance gains as the size of data indices increases significantly.

Theoretical Framework

The core theoretical contribution is the demonstration that dense representations suffer a more rapid performance decline than sparse alternatives as index sizes grow. This phenomenon is primarily attributed to the dimensions of the vector space used for dense representation. Specifically, reduced dimensionality escalates the probability of false positives, where irrelevant documents are mistakenly retrieved. The paper provides a mathematical articulation of this issue, proving that the probability of retrieving false positives increases with larger index sizes and fewer dimensions.

Empirical Validation

The authors validate this theoretical claim through empirical experiments using the MS MARCO passage retrieval dataset, revealing that dense representations indeed show a marked decline in performance with increasing index size. These empirical evaluations indicate that while dense retrieval models outperform sparse models like BM25 on smaller indices (e.g., 10,000 documents), they struggle with larger collections where sparse models can regain superiority.

Furthermore, dense models trained with hard negatives show improved resilience to large index sizes; however, the benefit diminishes at larger scale. Experiments involving random noise inserted into the index further corroborate the findings by illustrating the high likelihood of false positives due to the concentrated and anisotropic nature of dense vector spaces derived from pretrained models like BERT.

Practical Implications

The paper’s findings have significant implications for both the practical deployment of dense retrieval systems and the theoretical understanding of dense versus sparse approaches:

For practitioners, these results caution against the naive use of dense retrieval models in contexts with very large index sizes unless mitigated by combining techniques such as hard negatives or hybrid retrieval strategies.
From a theoretical standpoint, this research urges a reevaluation of the assumed superiority of dense retrieval in all scenarios. The dependency on dimensionality and the structure of the vector space hints at underlying complexities that warrant further exploration.

Future Directions in AI

The insights from this paper open several avenues for future research. Innovations might focus on:

Developing more robust dimensionality reduction techniques that preserve retrieval accuracy even in extensive indices.
Enhancing the uniform distribution of vectors within high-dimensional spaces to reduce the probability of false positives.
Investigating hybrid models that dynamically balance between dense and sparse retrieval approaches based on real-time context or index size conditions.

In conclusion, while dense low-dimensional retrieval models present an attractive alternative to traditional methods, their limitations at scale underscore the need for a nuanced application and continuous development within AI research frameworks. This paper is a notable contribution to the ongoing dialogue in the information retrieval research community, clarifying the trade-offs inherent in dense representation systems.