The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds (2405.17813v1)
Abstract: Vector search systems, pivotal in AI applications, often rely on the Hierarchical Navigable Small Worlds (HNSW) algorithm. However, the behaviour of HNSW under real-world scenarios using vectors generated with deep learning models remains under-explored. Existing Approximate Nearest Neighbours (ANN) benchmarks and research typically has an over-reliance on simplistic datasets like MNIST or SIFT1M and fail to reflect the complexity of current use-cases. Our investigation focuses on HNSW's efficacy across a spectrum of datasets, including synthetic vectors tailored to mimic specific intrinsic dimensionalities, widely-used retrieval benchmarks with popular embedding models, and proprietary e-commerce image data with CLIP models. We survey the most popular HNSW vector databases and collate their default parameters to provide a realistic fixed parameterisation for the duration of the paper. We discover that the recall of approximate HNSW search, in comparison to exact K Nearest Neighbours (KNN) search, is linked to the vector space's intrinsic dimensionality and significantly influenced by the data insertion sequence. Our methodology highlights how insertion order, informed by measurable properties such as the pointwise Local Intrinsic Dimensionality (LID) or known categories, can shift recall by up to 12 percentage points. We also observe that running popular benchmark datasets with HNSW instead of KNN can shift rankings by up to three positions for some models. This work underscores the need for more nuanced benchmarks and design considerations in developing robust vector search systems using approximate vector search algorithms. This study presents a number of scenarios with varying real world applicability which aim to better increase understanding and future development of ANN algorithms and embedding
- ANNOY library. URL https://github.com/spotify/annoy. Accessed: 2017-08-01.
- infgrad/stella-base-en-v2 · Hugging Face — huggingface.co. URL https://huggingface.co/infgrad/stella-base-en-v2. [Accessed 16-04-2024].
- Taylor AI. TaylorAI/bge-micro · Hugging Face — huggingface.co. URL https://huggingface.co/TaylorAI/bge-micro. [Accessed 16-04-2024].
- The role of local dimensionality measures in benchmarking nearest neighbor search. Information Systems, 101:101807, 2021. ISSN 0306-4379. doi:https://doi.org/10.1016/j.is.2021.101807. URL https://www.sciencedirect.com/science/article/pii/S0306437921000569.
- Ms marco: A human generated machine reading comprehension dataset, 2018.
- Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
- A full-text learning to rank dataset for medical information retrieval. 2016. URL http://www.cl.uni-heidelberg.de/~riezler/publications/papers/ECIR2016.pdf.
- Sebastian Bruch. Foundations of vector retrieval. arXiv preprint arXiv:2401.09350, 2024.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
- Chroma. Chroma hnsw parameters, 2023. URL https://github.com/chroma-core/chroma/blob/bdec54a/chromadb/segment/impl/vector/hnsw_params.py.
- Specter: Document-level representation learning using citation-informed transformers. In ACL, 2020.
- Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 537–546, 2008.
- Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
- The faiss library. 2024.
- Elastic. Elasticsearch dense vector, 2023. URL https://www.elastic.co/guide/en/elasticsearch/reference/8.11/dense-vector.html.
- hnswlib. Hnswlib github repository, 2023. URL https://github.com/nmslib/hnswlib/blob/3f3429661187e4c24a490a0f148fc6bc89042b3d/ALGO_PARAMS.md#search-parameters.
- Cqadupstack: A benchmark data set for community question-answering research. In Proceedings of the 20th Australasian Document Computing Symposium, ADCS ’15, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450340403. doi:10.1145/2838931.2838934. URL https://doi.org/10.1145/2838931.2838934.
- Fast nearest neighbor search through sparse random projections and voting. In Big Data (Big Data), 2016 IEEE International Conference on, pages 881–888. IEEE, 2016.
- Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
- Diskann: Fast accurate billion-point nearest neighbor search on a single node. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf.
- Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Ir evaluation methods for retrieving highly relevant documents. volume 20, pages 41–48, 07 2000. doi:10.1145/345508.345545.
- Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011. doi:10.1109/TPAMI.2010.57.
- Maximum likelihood estimation of intrinsic dimension. Advances in neural information processing systems, 17, 2004.
- Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023.
- Graph based nearest neighbor search: Promises and failures. arXiv preprint arXiv:1904.02077, 2019.
- Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018.
- Milvus. Milvus configuration, 2023. URL https://github.com/milvus-io/milvus/blob/601a8b801bfa1b3a69084bf0e63d32ea5bd31361/configs/milvus.yaml#L729.
- Mteb: Massive text embedding benchmark, 2023.
- Gpt-4 technical report, 2024.
- OpenSearch. Opensearch knn index, 2023. URL https://opensearch.org/docs/latest/search-plugins/knn/knn-index#method-definitions.
- pgvector. pgvector index options, 2023. URL https://github.com/pgvector/pgvector?tab=readme-ov-file#index-options.
- Qdrant. Qdrant indexing concepts, 2023. URL https://qdrant.tech/documentation/concepts/indexing/#vector-index.
- Learning transferable visual models from natural language supervision, 2021.
- LLM Rails. llmrails/ember-v1 · Hugging Face — huggingface.co. URL https://huggingface.co/llmrails/ember-v1. [Accessed 16-04-2024].
- Redis. Redis vector documentation, 2023. URL https://redis.io/docs/interact/search-and-query/advanced-concepts/vectors/.
- Facebook AI Research. Faiss hnsw documentation, 2023. URL https://faiss.ai/cpp_api/file/HNSW_8h.html.
- LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vmsMcY.
- Freshdiskann: A fast and accurate graph-based ann index for streaming similarity search, 2021.
- Pytrec_eval: An extremely fast python interface to trec_eval. In SIGIR. ACM, 2018.
- Vespa. Vespa hnsw index, 2023. URL https://docs.vespa.ai/en/reference/schema-reference.html#index-hnsw.
- Trec-covid: constructing a pandemic information retrieval test collection. SIGIR Forum, 54(1), feb 2021. ISSN 0163-5840. doi:10.1145/3451964.3451965. URL https://doi.org/10.1145/3451964.3451965.
- Retrieval of the best counterargument without prior topic knowledge. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1023. URL https://aclanthology.org/P18-1023.
- Fact or fiction: Verifying scientific claims, 2020.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
- Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672, 2024.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020.
- Weaviate. Weaviate vector index, 2023. URL https://weaviate.io/developers/weaviate/config-refs/schema/vector-index#hnsw-index-parameters.
- C-pack: Packaged resources to advance general chinese embedding, 2023.