- The paper presents Seismic, a retrieval algorithm that restructures inverted indexes using block-summary techniques to achieve sub-millisecond query speeds.
- It leverages static pruning, clustering into cohesive blocks, and summary vectors to efficiently filter candidate documents.
- Experimental results on the Ms Marco dataset confirm significant latency reduction and scalability while preserving retrieval accuracy.
Efficient Inverted Indexes for Learned Sparse Representations
Introduction
The paper presents a new approximate retrieval algorithm named Seismic, which enhances the search efficiency on learned sparse embeddings. The underpinning challenge addressed is the unsuitability of traditional inverted index-based retrieval techniques, such as WAND or MaxScore, when applied directly to learned sparse embeddings due to their distinct distribution characteristics compared to term frequency-based models like BM25. Seismic reorganizes the inverted index and amalgamates it with a forward index, optimizing the search process through strategic blocking and summarization of inverted lists.
Methodology
Seismic introduces a novel framework for indexing and retrieval that operates on geometrically cohesive blocks within an inverted index, each supplemented by a summary vector. The method can be delineated through the following components:
- Static Pruning and Blocking: Inverted lists for each dictionary term are truncated to retain entries only up to a threshold, reducing the index size. These pruned lists are then partitioned into blocks via a clustering algorithm, enhancing the cohesion within each block.
- Summary Vectors: Summaries are constructed for each block to approximate the maximum inner product a query might achieve with any document in the block. During querying, these summaries quickly ascertain whether a block contains potential candidate documents, thereby accelerating the querying process.
- Forward Index: Alongside the inverted index, a forward index is used to store exact document representations, facilitating precise computation of the inner product when a document needs to be scored.
- Query Processing: The querying mechanism involves a dual thresholding with heap data structures that efficiently manage the top-scoring documents, leveraging the summary vectors to bypass unlikely candidate blocks.
Experimental Results
The evaluation of Seismic is conducted against several strong baselines using the Ms Marco dataset and various learned sparse embeddings including Splade and Efficient Splade (E-Splade). The results are promising:
- Latency and Accuracy: Seismic offers substantial improvements in query latency, reaching sub-millisecond levels while maintaining competitive retrieval accuracy metrics. Compared to other state-of-the-art solutions, it can achieve latency reductions by an order of magnitude or more, depending on the embedding and configuration.
- Scalability: With respect to index size and build time, Seismic is shown to be efficient, creating compact and quickly constructable indexes that facilitate scalability to large datasets.
Theoretical Implications and Practical Applications
The development of Seismic contributes significantly both theoretically and practically. Theoretically, it challenges existing assumptions about the structures required for efficient inverted index-based retrieval by introducing a novel block-summary paradigm. Practically, it opens up new possibilities for implementing efficient and scalable information retrieval systems capable of handling modern sparse embeddings, which are increasingly prevalent due to their effectiveness and interpretability.
Future Directions
Potential future work includes exploring additional compression techniques for summaries and inverted lists to further enhance efficiency. Another interesting avenue could be the adaptation of Seismic's methodology to other forms of vector embeddings or different domains requiring efficient retrieval mechanisms.
Overall, Seismic represents a significant advancement in the field of information retrieval, particularly in the context of searching over learned sparse representations, and sets the stage for further innovations in this area.