Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling (2409.14683v1)

Published 23 Sep 2024 in cs.IR, cs.AI, and cs.CL

Abstract: Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space & memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.

Authors (3)

Benjamin Clavié (12 papers)
Antoine Chaffin (13 papers)
Griffin Adams (14 papers)

Summary

Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

Introduction

The paper "Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling" by Clavié et al. addresses a critical challenge in Neural Information Retrieval (IR) systems that employ multi-vector representations, exemplified by models like ColBERT. While such approaches offer superior retrieval performance, particularly in out-of-domain settings, they suffer from significant storage and memory inefficiencies due to the necessity of storing a vector for each token rather than a single vector per document. Clavié et al. propose a novel clustering-based token pooling method designed to substantially shrink the size of these models without incurring significant retrieval performance losses.

Token Pooling Approach

The core contribution of this paper is the introduction of token pooling, a method that reduces the total number of vectors stored during the indexation of documents. This is achieved by aggregating token representations through clustering, followed by mean pooling of the clustered vectors. Importantly, this strategy does not necessitate any architectural changes or additional query-time processing, allowing for straightforward adoption in existing ColBERT-like models.

Experimental Setup and Methodologies

The authors explored three distinct token pooling methods: Sequential Pooling, K-Means based pooling, and Hierarchical clustering-based pooling. These pooling strategies were evaluated at various compression levels, referred to as pooling factors, on a suite of well-established datasets from BEIR and LoTTe evaluation suites. The primary performance metrics were NDCG@10 for BEIR datasets and Success@5 for LoTTe datasets. The evaluation was conducted in both quantized and unquantized vector settings.

Results and Observations

Unquantized Vectors: The results demonstrated that Hierarchical clustering-based pooling, particularly, exhibited robust performance even at higher compression levels. For instance, a pooling factor of 2—implying a 50% reduction in the number of stored vectors—resulted in virtually no degradation in retrieval performance and, in some cases, even slight improvement. Increasing the pooling factor to 3 achieved a storage reduction of 66% with minimal performance degradation (under 1% on average).

Quantized Vectors: When combined with the standard 2-bit quantization used in ColBERTv2, the token pooling method continued to show strong performance. A pooling factor of 2 resulted in a mere 1.34% average performance degradation, with index sizes comparable to single-vector dense representations. The practical implications are significant, as this allows for more efficient storage while retaining flexibility in indexing methods like HNSW, which are CRUD-friendly.

Cross-Linguistic and Model Generalization: To demonstrate the generalizability of the token pooling approach, the authors conducted additional experiments using a Japanese variant of ColBERT, JaColBERTv2. The results corroborated the method’s efficacy across different languages and models, further highlighting its broad applicability.

Implications and Future Directions

The proposed token pooling method substantially lowers the storage footprint of multi-vector retrieval models while maintaining high retrieval performance. This not only facilitates the broader adoption of models like ColBERT in practical applications but also opens up new avenues for further research in multi-vector compression techniques. Future research could explore more sophisticated clustering algorithms, adaptive pooling strategies that vary across documents, and integration with other forms of index compression.

Conclusion

In summary, the paper introduces a highly effective method for reducing the storage requirements of multi-vector retrieval models through token pooling. The hierarchical clustering-based approach, in particular, shows promise in balancing compression and performance. By significantly diminishing storage footprints and enhancing indexing flexibility, this work contributes valuable advancements to the field of Neural IR, with implications for both theoretical research and practical deployment.

Acknowledgments

The authors express gratitude to Omar Khattab for his enthusiastic encouragements and valuable input during this research.

References

The references in the original paper include pivotal works on neural IR models, retrieval benchmarks, clustering methods, and techniques for vector quantization. Key references are:

Dense retrieval models \cite{dense}
ColBERT and ColBERTv2 \cite{colbert,colbertv2}
BEIR and LoTTe benchmarks \cite{beir,colbertv2}
Clustering algorithms and related studies \cite{kmeans, hierarch_clustering, wards}
Quantization and indexing strategies \cite{hnswindex, pq, plaid}

This comprehensive paper exemplifies how sophisticated, yet straightforward, strategies like token pooling can advance the efficiency and performance of retrieval systems using multi-vector representations.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/antoine_chaffin/status/1923378997519384758

https://twitter.com/fly51fly/status/1840139852496736552

https://twitter.com/antoine_chaffin/status/1919446143852769747

https://twitter.com/tonywu_71/status/1908086532856008852

https://twitter.com/antoine_chaffin/status/1872583106513035555

https://twitter.com/arXivGPT/status/1840870633821126731