Approximate Nearest Neighbor Search with Window Filters (2402.00943v2)
Abstract: We define and investigate the problem of $\textit{c-approximate window search}$: approximate nearest neighbor search where each point in the dataset has a numeric label, and the goal is to find nearest neighbors to queries within arbitrary label ranges. Many semantic search problems, such as image and document search with timestamp filters, or product search with cost filters, are natural examples of this problem. We propose and theoretically analyze a modular tree-based framework for transforming an index that solves the traditional c-approximate nearest neighbor problem into a data structure that solves window search. On standard nearest neighbor benchmark datasets equipped with random label values, adversarially constructed embeddings, and image search embeddings with real timestamps, we obtain up to a $75\times$ speedup over existing solutions at the same level of recall.
- Milvus-docs: Conduct a hybrid search, 2022. URL https://github.com/milvus-io/milvus-docs/blob/v2.1.x/site/en/userGuide/search/hybridsearch.md.
- Vearch doc operation: Search, 2022. URL https://vearch.readthedocs.io/en/latest/use_op/op_doc.html?highlight=filter#search.
- Vespa use cases: Semi-structured navigation, 2022. URL https://docs.vespa.ai/en/attributes.html.
- Weaviate documentation: Filters, 2022. URL https://weaviate.io/developers/weaviate/current/graphql-references/filters.html.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117–122, 2008.
- Approximate nearest neighbor queries in fixed dimensions. In ACM-SIAM Symposium on Discrete Algorithms, pp. 271–280, 1993.
- ANN-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 87:101374, 2020.
- Bentley, J. L. Algorithms for Klee’s rectangle problems. Technical report, Technical Report, 1977.
- ParlayLib–a toolkit for parallel algorithms on shared-memory multicore machines. In ACM Symposium on Parallelism in Algorithms and Architectures, pp. 507–509, 2020.
- Fast density peak clustering for large scale data based on knn. Knowledge-Based Systems, 187:104824, 2020.
- Clarkson, K. L. Nearest neighbor queries in metric spaces. In ACM Symposium on Theory of Computing, pp. 609–617, 1997.
- RedCaps: Web-curated image-text data created by the people, for the people. In Advances in Neural Information Processing Systems, 2021.
- Scaling graph-based anns algorithms to billion-size datasets: A comparative analysis, 2023.
- The Faiss library. arXiv e-prints, 2024.
- Fenwick, P. M. A new data structure for cumulative frequency tables. Software: Practice and Experience, 24(3):327–336, 1994.
- Filtered-DiskANN: Graph algorithms for approximate nearest neighbor search with filters. In ACM Web Conference, pp. 3406–3416, 2023.
- CAPS: A practical partition index for filtered similarity search, 2023.
- Faster parallel exact density peaks clustering. In SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23), pp. 49–62. SIAM, 2023.
- A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv., 40(4), oct 2008.
- Worst-case performance of popular approximate nearest neighbor search implementations: Guarantees and limitations. In Advances in Neural Information Processing Systems, 2023.
- DiskANN: Fast accurate billion-point nearest neighbor search on a single node. In Advances in Neural Information Processing Systems, 2019.
- Navigating nets: simple algorithms for proximity search. In ACM-SIAM Symposium on Discrete Algorithms, pp. 798–807, 2004.
- Querying very large multi-dimensional datasets in ADR. In ACM/IEEE Conference on Supercomputing, 1999.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
- Practical trade-offs for the prefix-sum problem. Software: Practice and Experience, 51(5):921–949, 2021.
- Pinecone Systems, Inc. Overview, 2024. URL https://docs.pinecone.io/docs/overview.
- Graph-based nearest neighbor search: From practice to theory. In International Conference on Machine Learning, pp. 7803–7813, 2020.
- A comparative study of secondary indexing techniques in LSM-based NoSQL databases. In International Conference on Management of Data, pp. 551–566, 2018.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763, 2021.
- Rubinstein, A. Hardness of approximate nearest neighbor search. In ACM SIGACT Symposium on Theory of Computing, pp. 1260–1268, 2018.
- NeurIPS’23 competition track: Big-ANN, 2023. URL https://big-ann-benchmarks.com/neurips23.html.
- Pecann: Parallel efficient clustering with graph-based approximate nearest neighbor search. arXiv preprint arXiv:2312.03940, 2023.
- VBASE: Unifying online vector similarity search and relational queries via relaxed monotonicity. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 377–395, 2023.