Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations (2404.18812v1)

Published 29 Apr 2024 in cs.IR

Abstract: Learned sparse representations form an attractive class of contextual embeddings for text retrieval. That is so because they are effective models of relevance and are interpretable by design. Despite their apparent compatibility with inverted indexes, however, retrieval over sparse embeddings remains challenging. That is due to the distributional differences between learned embeddings and term frequency-based lexical models of relevance such as BM25. Recognizing this challenge, a great deal of research has gone into, among other things, designing retrieval algorithms tailored to the properties of learned sparse representations, including approximate retrieval systems. In fact, this task featured prominently in the latest BigANN Challenge at NeurIPS 2023, where approximate algorithms were evaluated on a large benchmark dataset by throughput and recall. In this work, we propose a novel organization of the inverted index that enables fast yet effective approximate retrieval over learned sparse embeddings. Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector. During query processing, we quickly determine if a block must be evaluated using the summaries. As we show experimentally, single-threaded query processing using our method, Seismic, reaches sub-millisecond per-query latency on various sparse embeddings of the MS MARCO dataset while maintaining high recall. Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and further outperforms the winning (graph-based) submissions to the BigANN Challenge by a significant margin.

Citations (5)

Summary

  • The paper presents Seismic, a retrieval algorithm that restructures inverted indexes using block-summary techniques to achieve sub-millisecond query speeds.
  • It leverages static pruning, clustering into cohesive blocks, and summary vectors to efficiently filter candidate documents.
  • Experimental results on the Ms Marco dataset confirm significant latency reduction and scalability while preserving retrieval accuracy.

Efficient Inverted Indexes for Learned Sparse Representations

Introduction

The paper presents a new approximate retrieval algorithm named Seismic, which enhances the search efficiency on learned sparse embeddings. The underpinning challenge addressed is the unsuitability of traditional inverted index-based retrieval techniques, such as WAND or MaxScore, when applied directly to learned sparse embeddings due to their distinct distribution characteristics compared to term frequency-based models like BM25. Seismic reorganizes the inverted index and amalgamates it with a forward index, optimizing the search process through strategic blocking and summarization of inverted lists.

Methodology

Seismic introduces a novel framework for indexing and retrieval that operates on geometrically cohesive blocks within an inverted index, each supplemented by a summary vector. The method can be delineated through the following components:

  • Static Pruning and Blocking: Inverted lists for each dictionary term are truncated to retain entries only up to a threshold, reducing the index size. These pruned lists are then partitioned into blocks via a clustering algorithm, enhancing the cohesion within each block.
  • Summary Vectors: Summaries are constructed for each block to approximate the maximum inner product a query might achieve with any document in the block. During querying, these summaries quickly ascertain whether a block contains potential candidate documents, thereby accelerating the querying process.
  • Forward Index: Alongside the inverted index, a forward index is used to store exact document representations, facilitating precise computation of the inner product when a document needs to be scored.
  • Query Processing: The querying mechanism involves a dual thresholding with heap data structures that efficiently manage the top-scoring documents, leveraging the summary vectors to bypass unlikely candidate blocks.

Experimental Results

The evaluation of Seismic is conducted against several strong baselines using the Ms Marco dataset and various learned sparse embeddings including Splade and Efficient Splade (E-Splade). The results are promising:

  • Latency and Accuracy: Seismic offers substantial improvements in query latency, reaching sub-millisecond levels while maintaining competitive retrieval accuracy metrics. Compared to other state-of-the-art solutions, it can achieve latency reductions by an order of magnitude or more, depending on the embedding and configuration.
  • Scalability: With respect to index size and build time, Seismic is shown to be efficient, creating compact and quickly constructable indexes that facilitate scalability to large datasets.

Theoretical Implications and Practical Applications

The development of Seismic contributes significantly both theoretically and practically. Theoretically, it challenges existing assumptions about the structures required for efficient inverted index-based retrieval by introducing a novel block-summary paradigm. Practically, it opens up new possibilities for implementing efficient and scalable information retrieval systems capable of handling modern sparse embeddings, which are increasingly prevalent due to their effectiveness and interpretability.

Future Directions

Potential future work includes exploring additional compression techniques for summaries and inverted lists to further enhance efficiency. Another interesting avenue could be the adaptation of Seismic's methodology to other forms of vector embeddings or different domains requiring efficient retrieval mechanisms.

Overall, Seismic represents a significant advancement in the field of information retrieval, particularly in the context of searching over learned sparse representations, and sets the stage for further innovations in this area.