We define and investigate the problem of $\textit{c-approximate window search}$: approximate nearest neighbor search where each point in the dataset has a numeric label, and the goal is to find nearest neighbors to queries within arbitrary label ranges. Many semantic search problems, such as image and document search with timestamp filters, or product search with cost filters, are natural examples of this problem. We propose and theoretically analyze a modular tree-based framework for transforming an index that solves the traditional c-approximate nearest neighbor problem into a data structure that solves window search. On standard nearest neighbor benchmark datasets equipped with random label values, adversarially constructed embeddings, and image search embeddings with real timestamps, we obtain up to a $75\times$ speedup over existing solutions at the same level of recall.
This paper introduces the first comprehensive solution to the c-approximate window search problem, combining numeric label-based filters with Approximate Nearest Neighbor Search (ANNS) for enhanced semantic search.
The proposed solution employs a modular tree-based framework and label-space partitioning to efficiently handle window search, providing runtime bounds and optimal partitioning strategies through a comprehensive theoretical analysis.
Empirical validation demonstrates significant improvements in search speed by up to 75× without sacrificing recall, compared with traditional methods, across real and synthetic datasets.
The work offers practical implications for vector database development and semantic search applications, presenting a versatile framework that can adapt ANNS solutions to complex filtered search needs.
The exploration of c-approximate window search integrates numeric label-based filters with Approximate Nearest Neighbor Search (ANNS), addressing a gap in large-scale, efficient semantic search. This research positions itself at the forefront of innovations in data retrieval, focusing on scenarios where search queries are not only about proximity in vector space but also about conformance to numerical criteria defined by window filters. The significance of this problem extends across various domains, from timestamp-based image retrieval to budget-limited product searches.
The authors present the first comprehensive solution to the c-approximate window search problem. This solution hinges on a modular tree-based framework and proposes:
The paper’s theoretical contributions lie in the adaptation of segment trees to ANNS, providing a method to structure datasets in a manner that efficiently supports window searches. Another pivotal theoretical advancement is the analysis of optimal partitioning strategies within the label space, bolstering the efficacy of window search algorithms in practice. These theoretical underpinnings give rise to a versatile framework capable of adapting existing ANNS solutions to the novel problem of window search.
From a practical standpoint, this research has profound implications for the development of vector databases and the enhancement of search functionalities within semantic search applications. The algorithms proposed not only demonstrate significant speed improvements over existing solutions but also open up new possibilities for fine-grained searches across varied datasets. Furthermore, the modularity of the framework ensures its applicability across multiple domains, potentially benefiting a broad spectrum of applications in need of efficient filtered search capabilities.
The exploration opens several avenues for future research, notably in optimizing tree structures for specific types of data distributions and investigating alternative partitioning strategies to further enhance performance. Another area ripe for exploration is the extension of the framework to support multi-dimensional labels, offering a richer set of filtering criteria for complex search scenarios.
This work marks a significant step towards addressing the nuanced needs of semantic search in the era of big data. By marrying numeric label-based filters with ANNS, it paves the way for more sophisticated and efficient search capabilities. The contributions of this research not only solve an existing problem but also lay the groundwork for future advancements in the domain of vector space search.
Milvus-docs: Conduct a hybrid search, 2022. https://github.com/milvus-io/milvus-docs/blob/v2.1.x/site/en/userGuide/search/hybridsearch.md.
Vearch doc operation: Search, 2022. https://vearch.readthedocs.io/en/latest/use_op/op_doc.html?highlight=filter#search.
Vespa use cases: Semi-structured navigation, 2022. https://docs.vespa.ai/en/attributes.html.
Weaviate documentation: Filters, 2022. https://weaviate.io/developers/weaviate/current/graphql-references/filters.html.
Pinecone Systems, Inc. Overview, 2024. https://docs.pinecone.io/docs/overview.
NeurIPS’23 competition track: Big-ANN, 2023. https://big-ann-benchmarks.com/neurips23.html.