Emergent Mind

Approximate Nearest Neighbor Search with Window Filters

(2402.00943)
Published Feb 1, 2024 in cs.DS , cs.IR , cs.LG and

Abstract

We define and investigate the problem of $\textit{c-approximate window search}$: approximate nearest neighbor search where each point in the dataset has a numeric label, and the goal is to find nearest neighbors to queries within arbitrary label ranges. Many semantic search problems, such as image and document search with timestamp filters, or product search with cost filters, are natural examples of this problem. We propose and theoretically analyze a modular tree-based framework for transforming an index that solves the traditional c-approximate nearest neighbor problem into a data structure that solves window search. On standard nearest neighbor benchmark datasets equipped with random label values, adversarially constructed embeddings, and image search embeddings with real timestamps, we obtain up to a $75\times$ speedup over existing solutions at the same level of recall.

Overview

  • This paper introduces the first comprehensive solution to the c-approximate window search problem, combining numeric label-based filters with Approximate Nearest Neighbor Search (ANNS) for enhanced semantic search.

  • The proposed solution employs a modular tree-based framework and label-space partitioning to efficiently handle window search, providing runtime bounds and optimal partitioning strategies through a comprehensive theoretical analysis.

  • Empirical validation demonstrates significant improvements in search speed by up to 75× without sacrificing recall, compared with traditional methods, across real and synthetic datasets.

  • The work offers practical implications for vector database development and semantic search applications, presenting a versatile framework that can adapt ANNS solutions to complex filtered search needs.

Introduction

The exploration of c-approximate window search integrates numeric label-based filters with Approximate Nearest Neighbor Search (ANNS), addressing a gap in large-scale, efficient semantic search. This research positions itself at the forefront of innovations in data retrieval, focusing on scenarios where search queries are not only about proximity in vector space but also about conformance to numerical criteria defined by window filters. The significance of this problem extends across various domains, from timestamp-based image retrieval to budget-limited product searches.

Main Contributions

The authors present the first comprehensive solution to the c-approximate window search problem. This solution hinges on a modular tree-based framework and proposes:

  • A formal definition and examination of the c-approximate window search problem.

  • Novel algorithms utilizing a tree-based framework and label-space partitioning to efficiently tackle window search.

  • A comprehensive theoretical analysis, offering runtime bounds and optimal partitioning strategies.

  • Empirical validation of proposed methods against established baselines, showcasing up to a 75× speed increase without sacrificing recall on real-world and synthetic datasets.

Theoretical Insights

The paper’s theoretical contributions lie in the adaptation of segment trees to ANNS, providing a method to structure datasets in a manner that efficiently supports window searches. Another pivotal theoretical advancement is the analysis of optimal partitioning strategies within the label space, bolstering the efficacy of window search algorithms in practice. These theoretical underpinnings give rise to a versatile framework capable of adapting existing ANNS solutions to the novel problem of window search.

Practical Implications

From a practical standpoint, this research has profound implications for the development of vector databases and the enhancement of search functionalities within semantic search applications. The algorithms proposed not only demonstrate significant speed improvements over existing solutions but also open up new possibilities for fine-grained searches across varied datasets. Furthermore, the modularity of the framework ensures its applicability across multiple domains, potentially benefiting a broad spectrum of applications in need of efficient filtered search capabilities.

Future Directions

The exploration opens several avenues for future research, notably in optimizing tree structures for specific types of data distributions and investigating alternative partitioning strategies to further enhance performance. Another area ripe for exploration is the extension of the framework to support multi-dimensional labels, offering a richer set of filtering criteria for complex search scenarios.

Conclusion

This work marks a significant step towards addressing the nuanced needs of semantic search in the era of big data. By marrying numeric label-based filters with ANNS, it paves the way for more sophisticated and efficient search capabilities. The contributions of this research not only solve an existing problem but also lay the groundwork for future advancements in the domain of vector space search.

Get summaries of trending AI/ML papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. Milvus-docs: Conduct a hybrid search, 2022. https://github.com/milvus-io/milvus-docs/blob/v2.1.x/site/en/userGuide/search/hybridsearch.md.

  2. Vearch doc operation: Search, 2022. https://vearch.readthedocs.io/en/latest/use_op/op_doc.html?highlight=filter#search.

  3. Vespa use cases: Semi-structured navigation, 2022. https://docs.vespa.ai/en/attributes.html.

  4. Weaviate documentation: Filters, 2022. https://weaviate.io/developers/weaviate/current/graphql-references/filters.html.

  5. GPT-4 Technical Report
  6. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117–122
  7. Approximate nearest neighbor queries in fixed dimensions. In ACM-SIAM Symposium on Discrete Algorithms, pp.  271–280
  8. ANN-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 87:101374
  9. Bentley, J. L. Algorithms for Klee’s rectangle problems. Technical report, Technical Report
  10. ParlayLib–a toolkit for parallel algorithms on shared-memory multicore machines. In ACM Symposium on Parallelism in Algorithms and Architectures, pp.  507–509
  11. Fast density peak clustering for large scale data based on knn. Knowledge-Based Systems, 187:104824
  12. Clarkson, K. L. Nearest neighbor queries in metric spaces. In ACM Symposium on Theory of Computing, pp.  609–617
  13. RedCaps: Web-curated image-text data created by the people, for the people. In Advances in Neural Information Processing Systems
  14. Scaling graph-based anns algorithms to billion-size datasets: A comparative analysis
  15. The Faiss library. arXiv e-prints
  16. Fenwick, P. M. A new data structure for cumulative frequency tables. Software: Practice and Experience, 24(3):327–336
  17. Filtered-DiskANN: Graph algorithms for approximate nearest neighbor search with filters. In ACM Web Conference, pp.  3406–3416
  18. CAPS: A practical partition index for filtered similarity search
  19. Faster parallel exact density peaks clustering. In SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23), pp.  49–62. SIAM
  20. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv., 40(4), oct 2008.
  21. Worst-case performance of popular approximate nearest neighbor search implementations: Guarantees and limitations. In Advances in Neural Information Processing Systems
  22. DiskANN: Fast accurate billion-point nearest neighbor search on a single node. In Advances in Neural Information Processing Systems
  23. Navigating nets: simple algorithms for proximity search. In ACM-SIAM Symposium on Discrete Algorithms, pp.  798–807
  24. Querying very large multi-dimensional datasets in ADR. In ACM/IEEE Conference on Supercomputing
  25. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback
  26. Practical trade-offs for the prefix-sum problem. Software: Practice and Experience, 51(5):921–949
  27. Pinecone Systems, Inc. Overview, 2024. https://docs.pinecone.io/docs/overview.

  28. Graph-based nearest neighbor search: From practice to theory. In International Conference on Machine Learning, pp.  7803–7813
  29. A comparative study of secondary indexing techniques in LSM-based NoSQL databases. In International Conference on Management of Data, pp.  551–566
  30. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.  8748–8763
  31. Rubinstein, A. Hardness of approximate nearest neighbor search. In ACM SIGACT Symposium on Theory of Computing, pp.  1260–1268
  32. NeurIPS’23 competition track: Big-ANN, 2023. https://big-ann-benchmarks.com/neurips23.html.

  33. PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search
  34. VBASE: Unifying online vector similarity search and relational queries via relaxed monotonicity. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp.  377–395

Show All 34