Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distance Comparison Operators for Approximate Nearest Neighbor Search: Exploration and Benchmark (2403.13491v1)

Published 20 Mar 2024 in cs.DB

Abstract: Approximate nearest neighbor search (ANNS) on high-dimensional vectors has become a fundamental and essential component in various machine learning tasks. Prior research has shown that the distance comparison operation is the bottleneck of ANNS, which determines the query and indexing performance. To overcome this challenge, some novel methods have been proposed recently. The basic idea is to estimate the actual distance with fewer calculations, at the cost of accuracy loss. Inspired by this, we also propose that some classical techniques and deep learning models can also be adapted to this purpose. In this paper, we systematically categorize the techniques that have been or can be used to accelerate distance approximation. And to help the users understand the pros and cons of different techniques, we design a fair and comprehensive benchmark, Fudist implements these techniques with the same base index and evaluates them on 16 real datasets with several evaluation metrics. Designed as an independent and portable library, Fudist is orthogonal to the specific index structure and thus can be easily utilized in the current ANNS library to achieve significant improvements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Principal component analysis. WIREs Computational Statistics, 2(4):433–459, 2010.
  2. Estimating local intrinsic dimensionality. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 29–38, New York, NY, USA, 2015. Association for Computing Machinery.
  3. Hd-index: Pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. Proc. VLDB Endow., 11(8):906–919, apr 2018.
  4. Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 87:101374, 2020.
  5. The role of local dimensionality measures in benchmarking nearest neighbor search. Information Systems, 101:101807, 2021.
  6. Elpis: Graph-based similarity search for scalable data science. Proc. VLDB Endow., 16(6):1548–1559, apr 2023.
  7. Kin-Pong Chan and Ada Wai-Chee Fu. Efficient time series matching by wavelets. In Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337), pages 126–133, 1999.
  8. Finger: Fast inference for graph-based approximate nearest neighbor search. In Proceedings of the ACM Web Conference 2023, page 3225–3235, New York, NY, USA, 2023. Association for Computing Machinery.
  9. Google news personalization: Scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, page 271–280. Association for Computing Machinery, 2007.
  10. Scaling graph-based anns algorithms to billion-size datasets: A comparative analysis, 2023.
  11. Lanns: A web-scale approximate nearest neighbor lookup system. Proc. VLDB Endow., 15(4):850–858, dec 2021.
  12. Return of the lernaean hydra: Experimental evaluation of data series approximate similarity search. Proc. VLDB Endow., 13(3):403–420, nov 2019.
  13. High-dimensional approximate nearest neighbor search: with reliable and efficient distance comparison operations. Proc. ACM Manag. Data, 1(1), may 2023.
  14. Optimized product quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):744–755, 2014.
  15. Tcmalloc: Thread-caching malloc. https://goog-perftools.sourceforge.net/doc/tcmalloc.html, 2022. Accessed: June 2023.
  16. Data series progressive similarity search with probabilistic quality guarantees. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD ’20, page 1857–1873, New York, NY, USA, 2020. Association for Computing Machinery.
  17. Accelerating large-scale inference with anisotropic vector quantization. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3887–3896. PMLR, 13–18 Jul 2020.
  18. Retrieval augmented language model pre-training. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR, 13–18 Jul 2020.
  19. Generalized product quantization network for semi-supervised image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  20. Diskann: Fast accurate billion-point nearest neighbor search on a single node. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  21. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  22. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011.
  23. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc., 2020.
  24. A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110, 2022.
  25. Approximate nearest neighbor search on high dimensional data — experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering, 32(8):1475–1488, 2020.
  26. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836, 2020.
  27. D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 2161–2168, 2006.
  28. Efficient approximate nearest neighbor search in multi-dimensional databases. Proc. ACM Manag. Data, 1(1), may 2023.
  29. A survey on graph-based methods for similarity searches in metric spaces. Information Systems, 95:101507, 2021.
  30. Results of the neurips’21 challenge on billion-scale approximate nearest neighbor search. In Douwe Kiela, Marco Ciccone, and Barbara Caputo, editors, Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track, volume 176 of Proceedings of Machine Learning Research, pages 177–189. PMLR, 06–14 Dec 2022.
  31. Milvus: A purpose-built vector data management system. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD ’21, page 2614–2627, New York, NY, USA, 2021. Association for Computing Machinery.
  32. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proc. VLDB Endow., 14(11):1964–1978, jul 2021.
  33. Deep learning embeddings for data series similarity search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, page 1708–1716. ACM, 2021.
  34. Dumpy: A compact and adaptive index for large data series collections. Proc. ACM Manag. Data, 1(1), may 2023.
  35. Efficient index construction and approximate nearest neighbor search in high-dimensional spaces. Proc. VLDB Endow., 16(8):1979–1991, 2023.
  36. Pm-lsh: A fast and accurate lsh framework for high-dimensional approximate nn search. Proc. VLDB Endow., 13(5):643–655, jan 2020.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com