Searching Dense Representations with Inverted Indexes
Abstract: Nearly all implementations of top-$k$ retrieval with dense vector representations today take advantage of hierarchical navigable small-world network (HNSW) indexes. However, the generation of vector representations and efficiently searching large collections of vectors are distinct challenges that can be decoupled. In this work, we explore the contrarian approach of performing top-$k$ retrieval on dense vector representations using inverted indexes. We present experiments on the MS MARCO passage ranking dataset, evaluating three dimensions of interest: output quality, speed, and index size. Results show that searching dense representations using inverted indexes is possible. Our approach exhibits reasonable effectiveness with compact indexes, but is impractically slow. Thus, while workable, our solution does not provide a compelling tradeoff and is perhaps best characterized today as a "technical curiosity".
- Large scale indexing and searching deep convolutional neural network features. In Big Data Analytics and Knowledge Discovery (DaWaK 2016), pages 213–224, 2016.
- End-to-end retrieval with learned dense and sparse representations using Lucene. arXiv:2311.18503, 2023.
- Overview of the TREC 2019 Deep Learning Track. In Proceedings of the Twenty-Eighth Text REtrieval Conference Proceedings (TREC 2019), Gaithersburg, Maryland, 2019.
- Overview of the TREC 2020 Deep Learning Track. In Proceedings of the Twenty-Ninth Text REtrieval Conference Proceedings (TREC 2020), Gaithersburg, Maryland, 2020.
- MS MARCO: Benchmarking ranking models in the large-data regime. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 1566–1576, 2021.
- Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB 1999), pages 518–529, Edinburgh, Scotland, 1999.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Nov. 2020.
- J. Lin. A proposed conceptual framework for a representational approach to information retrieval. arXiv:2110.01529, 2021.
- Anserini gets dense retrieval: Integration of Lucene’s HNSW indexes. In Proceedings of the 32nd International Conference on Information and Knowledge Management (CIKM 2023), pages 5366–5370, Birmingham, the United Kingdom, 2023.
- Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. Transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836, 2020.
- T. Teofili. LexLSH analyzer. https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/util/fv/LSHAnalyzer.java, 2018. Accessed: 2023-12-03.
- T. Teofili and J. Lin. Lucene for approximate nearest-neighbors search on arbitrary dense vectors. arXiv:1910.10208, 2019.
- Anserini: Reproducible ranking baselines using Lucene. Journal of Data and Information Quality, 10(4):Article 16, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.