End-to-End Retrieval with Learned Dense and Sparse Representations Using Lucene (2311.18503v1)
Abstract: The bi-encoder architecture provides a framework for understanding machine-learned retrieval models based on dense and sparse vector representations. Although these representations capture parametric realizations of the same underlying conceptual framework, their respective implementations of top-$k$ similarity search require the coordination of different software components (e.g., inverted indexes, HNSW indexes, and toolkits for neural inference), often knitted together in complex architectures. In this work, we ask the following question: What's the simplest design, in terms of requiring the fewest changes to existing infrastructure, that can support end-to-end retrieval with modern dense and sparse representations? The answer appears to be that Lucene is sufficient, as we demonstrate in Anserini, a toolkit for reproducible information retrieval research. That is, effective retrieval with modern single-vector neural models can be efficiently performed directly in Java on the CPU. We examine the implications of this design for information retrieval researchers pushing the state of the art as well as for software engineers building production search systems.
- Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41–46, Toronto, Canada, July 2023.
- SparTerm: Learning term-based sparse representation for fast text retrieval. arXiv:2010.00768, 2020.
- Overview of the TREC 2019 Deep Learning Track. In Proceedings of the Twenty-Eighth Text REtrieval Conference Proceedings (TREC 2019), Gaithersburg, Maryland, 2019.
- Overview of the TREC 2020 Deep Learning Track. In Proceedings of the Twenty-Ninth Text REtrieval Conference Proceedings (TREC 2020), Gaithersburg, Maryland, 2020.
- MS MARCO: Benchmarking ranking models in the large-data regime. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 1566–1576, 2021.
- Z. Dai and J. Callan. Context-aware term weighting for first stage passage retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020), pages 1533–1536, 2020.
- Aligning the research and practice of building search applications: Elasticsearch and Pyserini. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining (WSDM 2022), pages 1573–1576, 2022.
- From distillation to hard negative sampling: Making sparse neural IR models more effective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022), pages 2353–2359, Madrid, Spain, 2022.
- COIL: Revisit exact lexical match in information retrieval with contextualized inverted list. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3030–3042, 2021.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Resources for brewing BEIR: Reproducible reference models and an official leaderboard. arXiv:2306.07471, 2023.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020.
- O. Khattab and M. Zaharia. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020), pages 39–48, 2020.
- SLIM: Sparsified late interaction for multi-vector retrieval with inverted indexes. In Proceedings of the 46th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), pages 1954–1959, Taipei, Taiwan, 2023.
- J. Lin. A proposed conceptual framework for a representational approach to information retrieval. arXiv:2110.01529, 2021.
- J. Lin and X. Ma. A few brief notes on DeepImpact, COIL, and a conceptual framework for information retrieval techniques. arXiv:2106.14807, 2021.
- Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362, 2021.
- Vector search with OpenAI embeddings: Lucene is all you need. arXiv:2308.14963, 2023.
- A replication study of dense passage retriever. arXiv:2104.05740, 2021.
- Anserini gets dense retrieval: Integration of Lucene’s HNSW indexes. In Proceedings of the 32nd International Conference on Information and Knowledge Management (CIKM 2023), pages 5366–5370, Birmingham, the United Kingdom, 2023a.
- Fine-tuning LLaMA for multi-stage text retrieval. arXiv:2310.08319, 2023b.
- Efficient document-at-a-time and score-at-a-time query evaluation for learned sparse representations. ACM Transactions on Information Systems, 41:Article No. 96, 2022.
- Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. Transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836, 2020.
- Learning passage impacts for inverted indexes. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 1723–1727, 2021.
- Augmented language models: a survey. arXiv:2302.07842, 2023.
- A unified framework for learned sparse retrieval. In European Conference on Information Retrieval, pages 101–116. Springer, 2023.
- PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.
- S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.
- Simple entity-centric questions challenge dense retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6138–6148, Online and Punta Cana, Dominican Republic, 2021.
- BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- Pay-per-request deployment of neural network models using serverless architectures. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 6–10, New Orleans, Louisiana, 2018.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008, Long Beach, California, 2017.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, 2020.
- Anserini: Reproducible ranking baselines using Lucene. Journal of Data and Information Quality, 10(4):Article 16, 2018.
- Haonan Chen (49 papers)
- Carlos Lassance (35 papers)
- Jimmy Lin (208 papers)