Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search (2501.10479v1)

Published 16 Jan 2025 in cs.LG, cs.DB, and cs.IR

Abstract: Approximate nearest neighbor search for vectors relies on indexes that are most often accessed from RAM. Therefore, storage is the factor limiting the size of the database that can be served from a machine. Lossy vector compression, i.e., embedding quantization, has been applied extensively to reduce the size of indexes. However, for inverted file and graph-based indices, auxiliary data such as vector ids and links (edges) can represent most of the storage cost. We introduce and evaluate lossless compression schemes for these cases. These approaches are based on asymmetric numeral systems or wavelet trees that exploit the fact that the ordering of ids is irrelevant within the data structures. In some settings, we are able to compress the vector ids by a factor 7, with no impact on accuracy or search runtime. On billion-scale datasets, this results in a reduction of 30% of the index size. Furthermore, we show that for some datasets, these methods can also compress the quantized vector codes losslessly, by exploiting sub-optimalities in the original quantization algorithm. The source code for our approach available at https://github.com/facebookresearch/vector_db_id_compression.

Summary

The paper introduces novel lossless compression techniques using ANS and wavelet trees to efficiently compress vector IDs in approximate nearest neighbor search.
It demonstrates up to 7x compression of vector IDs, yielding a 30% index size reduction on billion-scale datasets without affecting search accuracy.
The study evaluates both offline and online compression settings, offering actionable insights for optimizing large-scale search indices.

An Analysis of "Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search"

This paper presents a paper on the efficacy of lossless compression techniques applied to specific data structures within approximate nearest neighbor search (ANNS) frameworks. The authors put forward methods for compressing vector identifiers (IDs) and auxiliary data such as links in graph-based indices. The motivation behind this research stems from the observation that while lossy compression methods, such as embedding quantization, reduce the size of vector data, the metadata associated with these vectors, specifically vector IDs and links, can represent a significant storage overhead.

Key Contributions

Compression Techniques: The paper introduces lossless compression schemes based on asymmetric numeral systems (ANS) and wavelet trees. By leveraging the fact that the order of IDs and links is not critical within the data structures, significant reductions in storage can be achieved.
Compression Efficacy: For inverted file (IVF) and graph-based indices, the researchers demonstrate that their proposed methods can compress vector IDs by a factor of up to 7 without any adverse impact on the accuracy of search results or the runtime. This results in a 30% reduction in index size for billion-scale datasets.
Offline and Online Settings: The paper distinguishes between offline and online settings. The offline setting assumes the entire index can be compressed and decompressed in one go, typical for storage or transmission. Conversely, in the online setting, the compression must support on-the-fly decompression as needed, making it more challenging.
Graph Compression: The paper extends previous work by developing a graph compression algorithm applicable in the offline setting using random edge coding (REC) and other entropy-based methods, addressing a less-studied aspect of vector search index optimization.

Implications and Future Developments

The outcomes of this research have several implications for both practical applications and theoretical advancements in large-scale search systems:

Storage Efficiency: The findings suggest that large-scale search indices can benefit immensely from incorporating these lossless compression methods, significantly reducing hardware and operational costs associated with memory and storage requirements.
Enhanced Search Capabilities: By reducing memory overhead, systems can potentially maintain and search larger datasets in memory, improving response times and enabling more complex query handling.
Extended Applications: Although primarily focused on vector search frameworks, these compression strategies could be adapted for other domains where large-scale data indexing and retrieval are pertinent, such as recommendation systems or unstructured data search engines.
Potential Extensions: Future research could explore adaptive compression strategies that adjust the compression level dynamically based on query patterns or explore avenues for integrating these lossless methods with more advanced quantization techniques to further alleviate storage pressures.

Conclusion

The paper adeptly addresses a crucial inefficiency within ANNS frameworks—the often overlooked storage cost of metadata, specifically vector IDs. The sophisticated application of ANS and wavelet trees to lossless compression provides a compelling solution, with significant implications for the field of large-scale data retrieval. As AI and search technologies continue to scale, the relevance of such techniques will only grow, warranting further exploration and refinement. Overall, this research marks a foundational addition to the toolkit for index optimization in AI-driven search engines.