Semi-Parametric Retrieval via Binary Bag-of-Tokens Index (2405.01924v2)

Published 3 May 2024 in cs.CL, cs.AI, and cs.IR

Abstract: Information retrieval has transitioned from standalone systems into essential components across broader applications, with indexing efficiency, cost-effectiveness, and freshness becoming increasingly critical yet often overlooked. In this paper, we introduce SemI-parametric Disentangled Retrieval (SiDR), a bi-encoder retrieval framework that decouples retrieval index from neural parameters to enable efficient, low-cost, and parameter-agnostic indexing for emerging use cases. Specifically, in addition to using embeddings as indexes like existing neural retrieval methods, SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness. Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SiDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i) When using an embedding-based index, SiDR exceeds the performance of conventional neural retrievers while maintaining similar training complexity; (ii) When using a tokenization-based index, SiDR drastically reduces indexing cost and time, matching the complexity of traditional term-based retrieval, while consistently outperforming BM25 on all in-domain datasets; (iii) Additionally, we introduce a late parametric mechanism that matches BM25 index preparation time while outperforming other neural retrieval baselines in effectiveness.

Summary

The paper introduces SVDR, a dual-index retrieval framework achieving up to 9% higher top-1 accuracy with its binary token index over BM25.
It demonstrates significant efficiency gains by reducing index preparation time from 30 GPU hours to 2 CPU hours and storage from 31 GB to 2 GB.
SVDR’s dual-index system offers flexible trade-offs, supporting high-accuracy applications and quick, low-resource setups for rapid deployment.

Exploring SVDR: A Dual-Index Retrieval Framework

Introduction to Semi-parametric Vocabulary Disentangled Retrieval (SVDR)

Retrieving relevant information from massive datasets like Wikipedia efficiently has always been a challenge. To address this, researchers have developed the Semi-parametric Vocabulary Disentangled Retrieval (SVDR). This framework offers a novel approach by supporting two types of indexes: an embedding-based index for high accuracy and a binary token index for quick and cost-effective setup.

Key Findings from the SVDR Evaluation

SVDR was evaluated across three open-domain question answering benchmarks, and the results were quite compelling:

When using an embedding-based index, SVDR demonstrated a 3% higher top-1 retrieval accuracy than the dense retriever DPR.
For the binary token index, SVDR showed a 9% higher top-1 accuracy compared to BM25.
Impressively, switching to a binary token index reduced index preparation time from 30 GPU hours to 2 CPU hours and storage size from 31 GB to 2 GB—a 90% reduction in both respects.

These results indicate that SVDR not only matches but in many cases surpasses current state-of-the-art retrieval systems in efficiency and effectiveness.

Practical Implications of SVDR

The dual-index nature of SVDR offers flexibility depending on the needs:

High Accuracy Needs: For applications where retrieval accuracy is critical, such as in precision medicine or legal research, the embedding-based index is ideal.
Quick Setup and Low Resource Needs: When quick deployment or resource limitations are a concern, the binary token index offers a viable solution. This makes SVDR particularly appealing to startups and individual researchers with limited computational resources.

Theoretical Contributions and Future AI Developments

SVDR's approach also pushes forward our theoretical understanding in several ways:

Semi-parametric Learning: By efficiently integrating both parametric and non-parametric methods, SVDR facilitates a deeper understanding and utilization of semi-parametric models in AI.
Flexible Indexing: The research advances our capacity for flexible indexing strategies, crucial for developing AI systems that require dynamic updates without significant downtime.

Potential Future Research Directions

Given its success, SVDR sets the stage for multiple future research pathways:

Expansion to Other Domains: Testing and tuning SVDR in domains like real-time news, social media analysis, and other dynamically changing datasets could be promising.
Cost and Performance Optimization: Further research could explore optimizing the trade-offs between cost, performance, and accuracy, especially in resource-constrained environments.
Integration with Other AI Systems: There is potential to integrate SVDR with other AI operations, particularly those involving real-time learning and adaptation.

Conclusion

SVDR represents a significant step forward in information retrieval technology. Its ability to switch between a resource-heavy, highly accurate index and a quick, cost-effective one without compromising too much on performance addresses a long-standing challenge in the field. This duality makes it a promising tool for a wide array of applications, pushing the boundaries of what's possible in both academic research and practical AI implementations.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1787367068393988528