Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Approximate Nearest Neighbor Search with Window Filters (2402.00943v2)

Published 1 Feb 2024 in cs.DS, cs.IR, and cs.LG

Abstract: We define and investigate the problem of $\textit{c-approximate window search}$: approximate nearest neighbor search where each point in the dataset has a numeric label, and the goal is to find nearest neighbors to queries within arbitrary label ranges. Many semantic search problems, such as image and document search with timestamp filters, or product search with cost filters, are natural examples of this problem. We propose and theoretically analyze a modular tree-based framework for transforming an index that solves the traditional c-approximate nearest neighbor problem into a data structure that solves window search. On standard nearest neighbor benchmark datasets equipped with random label values, adversarially constructed embeddings, and image search embeddings with real timestamps, we obtain up to a $75\times$ speedup over existing solutions at the same level of recall.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Milvus-docs: Conduct a hybrid search, 2022. URL https://github.com/milvus-io/milvus-docs/blob/v2.1.x/site/en/userGuide/search/hybridsearch.md.
  2. Vearch doc operation: Search, 2022. URL https://vearch.readthedocs.io/en/latest/use_op/op_doc.html?highlight=filter#search.
  3. Vespa use cases: Semi-structured navigation, 2022. URL https://docs.vespa.ai/en/attributes.html.
  4. Weaviate documentation: Filters, 2022. URL https://weaviate.io/developers/weaviate/current/graphql-references/filters.html.
  5. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  6. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117–122, 2008.
  7. Approximate nearest neighbor queries in fixed dimensions. In ACM-SIAM Symposium on Discrete Algorithms, pp.  271–280, 1993.
  8. ANN-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 87:101374, 2020.
  9. Bentley, J. L. Algorithms for Klee’s rectangle problems. Technical report, Technical Report, 1977.
  10. ParlayLib–a toolkit for parallel algorithms on shared-memory multicore machines. In ACM Symposium on Parallelism in Algorithms and Architectures, pp.  507–509, 2020.
  11. Fast density peak clustering for large scale data based on knn. Knowledge-Based Systems, 187:104824, 2020.
  12. Clarkson, K. L. Nearest neighbor queries in metric spaces. In ACM Symposium on Theory of Computing, pp.  609–617, 1997.
  13. RedCaps: Web-curated image-text data created by the people, for the people. In Advances in Neural Information Processing Systems, 2021.
  14. Scaling graph-based anns algorithms to billion-size datasets: A comparative analysis, 2023.
  15. The Faiss library. arXiv e-prints, 2024.
  16. Fenwick, P. M. A new data structure for cumulative frequency tables. Software: Practice and Experience, 24(3):327–336, 1994.
  17. Filtered-DiskANN: Graph algorithms for approximate nearest neighbor search with filters. In ACM Web Conference, pp.  3406–3416, 2023.
  18. CAPS: A practical partition index for filtered similarity search, 2023.
  19. Faster parallel exact density peaks clustering. In SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23), pp.  49–62. SIAM, 2023.
  20. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv., 40(4), oct 2008.
  21. Worst-case performance of popular approximate nearest neighbor search implementations: Guarantees and limitations. In Advances in Neural Information Processing Systems, 2023.
  22. DiskANN: Fast accurate billion-point nearest neighbor search on a single node. In Advances in Neural Information Processing Systems, 2019.
  23. Navigating nets: simple algorithms for proximity search. In ACM-SIAM Symposium on Discrete Algorithms, pp.  798–807, 2004.
  24. Querying very large multi-dimensional datasets in ADR. In ACM/IEEE Conference on Supercomputing, 1999.
  25. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
  26. Practical trade-offs for the prefix-sum problem. Software: Practice and Experience, 51(5):921–949, 2021.
  27. Pinecone Systems, Inc. Overview, 2024. URL https://docs.pinecone.io/docs/overview.
  28. Graph-based nearest neighbor search: From practice to theory. In International Conference on Machine Learning, pp.  7803–7813, 2020.
  29. A comparative study of secondary indexing techniques in LSM-based NoSQL databases. In International Conference on Management of Data, pp.  551–566, 2018.
  30. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.  8748–8763, 2021.
  31. Rubinstein, A. Hardness of approximate nearest neighbor search. In ACM SIGACT Symposium on Theory of Computing, pp.  1260–1268, 2018.
  32. NeurIPS’23 competition track: Big-ANN, 2023. URL https://big-ann-benchmarks.com/neurips23.html.
  33. Pecann: Parallel efficient clustering with graph-based approximate nearest neighbor search. arXiv preprint arXiv:2312.03940, 2023.
  34. VBASE: Unifying online vector similarity search and relational queries via relaxed monotonicity. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp.  377–395, 2023.
Citations (1)

Summary

  • The paper presents a tree-based framework integrating numeric window filters into Approximate Nearest Neighbor Search for robust semantic retrieval.
  • It achieves up to 75× speed improvements over traditional methods while maintaining high recall on both real-world and synthetic datasets.
  • The modular design of the framework broadens applications, notably in timestamp-based and budget-limited searches, setting a new standard in search efficiency.

Innovations in Approximate Nearest Neighbor Search: A Dive into Window Filters

Introduction

The exploration of c-approximate window search integrates numeric label-based filters with Approximate Nearest Neighbor Search (ANNS), addressing a gap in large-scale, efficient semantic search. This research positions itself at the forefront of innovations in data retrieval, focusing on scenarios where search queries are not only about proximity in vector space but also about conformance to numerical criteria defined by window filters. The significance of this problem extends across various domains, from timestamp-based image retrieval to budget-limited product searches.

Main Contributions

The authors present the first comprehensive solution to the c-approximate window search problem. This solution hinges on a modular tree-based framework and proposes:

  • A formal definition and examination of the c-approximate window search problem.
  • Novel algorithms utilizing a tree-based framework and label-space partitioning to efficiently tackle window search.
  • A comprehensive theoretical analysis, offering runtime bounds and optimal partitioning strategies.
  • Empirical validation of proposed methods against established baselines, showcasing up to a 75× speed increase without sacrificing recall on real-world and synthetic datasets.

Theoretical Insights

The paper’s theoretical contributions lie in the adaptation of segment trees to ANNS, providing a method to structure datasets in a manner that efficiently supports window searches. Another pivotal theoretical advancement is the analysis of optimal partitioning strategies within the label space, bolstering the efficacy of window search algorithms in practice. These theoretical underpinnings give rise to a versatile framework capable of adapting existing ANNS solutions to the novel problem of window search.

Practical Implications

From a practical standpoint, this research has profound implications for the development of vector databases and the enhancement of search functionalities within semantic search applications. The algorithms proposed not only demonstrate significant speed improvements over existing solutions but also open up new possibilities for fine-grained searches across varied datasets. Furthermore, the modularity of the framework ensures its applicability across multiple domains, potentially benefiting a broad spectrum of applications in need of efficient filtered search capabilities.

Future Directions

The exploration opens several avenues for future research, notably in optimizing tree structures for specific types of data distributions and investigating alternative partitioning strategies to further enhance performance. Another area ripe for exploration is the extension of the framework to support multi-dimensional labels, offering a richer set of filtering criteria for complex search scenarios.

Conclusion

This work marks a significant step towards addressing the nuanced needs of semantic search in the era of big data. By marrying numeric label-based filters with ANNS, it paves the way for more sophisticated and efficient search capabilities. The contributions of this research not only solve an existing problem but also lay the groundwork for future advancements in the domain of vector space search.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com