Approximate Cluster-Based Sparse Document Retrieval with Segmented Maximum Term Weights (2404.08896v1)
Abstract: This paper revisits cluster-based retrieval that partitions the inverted index into multiple groups and skips the index partially at cluster and document levels during online inference using a learned sparse representation. It proposes an approximate search scheme with two parameters to control the rank-safeness competitiveness of pruning with segmented maximum term weights within each cluster. Cluster-level maximum weight segmentation allows an improvement in the rank score bound estimation and threshold-based pruning to be approximately adaptive to bound estimation tightness, resulting in better relevance and efficiency. The experiments with MS MARCO passage ranking and BEIR datasets demonstrate the usefulness of the proposed scheme with a comparison to the baselines. This paper presents the design of this approximate retrieval scheme with rank-safeness analysis, compares clustering and segmentation options, and reports evaluation results.
- Efficient Query Evaluation Using a Two-level Retrieval Process. In Proc. of the 12th ACM International Conference on Information and Knowledge Management. 426–434.
- MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. ArXiv abs/1611.09268 (2016).
- Efficiency and effectiveness of query processing in cluster-based retrieval. Information Systems 29 (12 2004), 697–717.
- A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (Cambridge, United Kingdom) (WSDM ’17). ACM, New York, NY, USA, 201–210.
- Overview of the TREC 2020 Deep Learning Track. ArXiv abs/2102.07662 (2020).
- Zhuyun Dai and Jamie Callan. 2020. Context-Aware Term Weighting For First Stage Passage Retrieval. SIGIR (2020).
- The Sparse MinMax k-Means Algorithm for High-Dimensional Clustering. In Proc. of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20. 2103–2110.
- Shuai Ding and Torsten Suel. 2011. Faster Top-k Document Retrieval Using Block-Max Indexes. In Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 993–1002.
- SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. ArXiv abs/2109.10086 (2021).
- From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (2022).
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. SIGIR (2021).
- COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. NAACL (2021).
- On the Efficiency of Selective Search. In ECIR 2017, Vol. 10193. 705–712.
- Billion-scale similarity search with GPUs. IEEE Trans. on Big Data 7, 3 (2019), 535–547.
- A probabilistic model of information retrieval: development and comparative experiments. In Information Processing and Management. 779–840.
- Finding the best of both worlds: Faster and more robust top-k document retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1031–1040.
- Efficient distributed selective search. Inf. Retr. J. 20, 3 (2017), 221–252.
- Anagha Kulkarni and Jamie Callan. 2015. Selective Search: Efficient and Effective Search of Large Textual Collections. ACM Trans. Inf. Syst. 33, 4 (2015), 17:1–17:33.
- Oren Kurland. 2008. The opposite of smoothing: A language model approach to ranking query-specific document clusters. In Proce. of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 171–178.
- Carlos Lassance and Stéphane Clinchant. 2022. An efficiency study for SPLADE models. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2220–2226.
- Carlos Lassance and Stephane Clinchant. 2023. The Tale of Two MSMARCO - and Their Unfair Comparisons. In Proceed. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23). ACM, New York, NY, USA, 2431–2435.
- A Static Pruning Study on Sparse Neural Retrievers. In Proc. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1771–1775.
- Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Softw. Pract. Exp. 45, 1 (2015), 1–29.
- Jimmy Lin and Andrew Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In Proceedings of the 2015 International Conference on The Theory of Information Retrieval (Northampton, Massachusetts, USA) (ICTIR ’15). Association for Computing Machinery, New York, NY, USA, 301–304. https://doi.org/10.1145/2808194.2809477
- Jimmy J. Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. ArXiv abs/2106.14807 (2021).
- Xiaoyong Liu and W Bruce Croft. 2004. Cluster-based retrieval using language models. In Proc. of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. 186–193.
- Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. on Information Theory 28, 2 (1982), 129–137. https://doi.org/10.1109/TIT.1982.1056489
- Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (Portland, Oregon, USA) (SIGIR ’12). Association for Computing Machinery, New York, NY, USA, 1051–1052.
- Accelerating Learned Sparse Indexes Via Term Impact Decomposition. In Findings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). ACL, Abu Dhabi, United Arab Emirates, 2830–2842.
- Anytime Ranking on Document-Ordered Indexes. ACM Trans. Inf. Syst. 40, 1, Article 13 (sep 2021), 32 pages.
- Learning Passage Impacts for Inverted Indexes. SIGIR (2021).
- Faster learned sparse retrieval with guided traversal. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1901–1905.
- Faster BlockMax WAND with Variable-sized Blocks. In Proc. of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 625–634.
- PISA: Performant indexes and search for academia. Proceedings of the Open-Source IR Replicability Challenge (2019).
- An Experimental Study of Index Compression and DAAT Query Processing Methods. In Proc. of 41st European Conference on IR Research, ECIR’ 2019. 353–368.
- From Cluster Ranking to Document Ranking. In Proc. of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). 2137–2141.
- Representation Sparsification with Hybrid Thresholding for Fast SPLADE-based Document Retrieval. ACM SIGIR’23 (2023).
- Optimizing Guided Traversal for Fast Learned Sparse Retrieval. In Proceedings of the ACM Web Conference 2023 (WWW ’23). ACM, Austin, TX, USA.
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. ACM, Online and Punta Cana, Dominican Republic, 2825–2835.
- ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL’22 ArXiv abs/2112.01488 (2022).
- LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=PfpEtB3-csK
- Trevor Strohman and W. Bruce Croft. 2007. Efficient document retrieval in main memory. In Proc. of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval. 175–182.
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=wCu6T5xFjeJ
- Efficient and Effective Retrieval Using Selective Pruning. In Proc. of the Sixth ACM International Conference on Web Search and Data Mining (WSDM ’13). ACM, 63–72.
- Howard Turtle and James Flood. 1995. Query Evaluation: Strategies and Optimizations. Information Processing & Management 31, 6 (1995), 831–850.
- SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval. ACL (2023).
- Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. ArXiv abs/1609.08144 (2016).
- RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder. EMNLP (2022).
- From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. Proceedings of the 27th ACM International Conference on Information and Knowledge Management (2018).
- Simple and Scalable Sparse k-means Clustering via Feature Ranking. In NeurIPS 20220: Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. 10148–10160.