Implementing and Evaluating E2LSH on Storage (2403.16404v1)
Abstract: Locality sensitive hashing (LSH) is one of the widely-used approaches to approximate nearest neighbor search (ANNS) in high-dimensional spaces. The first work on LSH for the Euclidean distance, E2LSH, showed how ANNS can be solved efficiently at a sublinear query time in the database size with theoretically-guaranteed accuracy, although it required a large hash index size. Since then, several LSH variants having much smaller index sizes have been proposed. Their query time is linear or superlinear, but they have been shown to run effectively faster because they require fewer I/Os when the index is stored on hard disk drives and because they also permit in-memory execution with modern DRAM capacity. In this paper, we show that E2LSH is regaining the advantage in query speed with the advent of modern flash storage devices such as solid-state drives (SSDs). We evaluate E2LSH on a modern single-node computing environment and analyze its computational cost and I/O cost, from which we derive storage performance requirements for its external memory execution. Our analysis indicates that E2LSH on a single consumer-grade SSD can run faster than the state-of-the-art small-index methods executed in-memory. It also indicates that E2LSH with emerging high-performance storage devices and interfaces can approach in-memory E2LSH speeds. We implement a simple adaptation of E2LSH to external memory, E2LSH-on-Storage (E2LSHoS), and evaluate it for practical large datasets of up to one billion objects using different combinations of modern storage devices and interfaces. We demonstrate that our E2LSHoS implementation runs much faster than small-index methods and can approach in-memory E2LSH speeds, and also that its query time scales sublinearly with the database size beyond the index size limit of in-memory E2LSH.
- A Survey on Nearest Neighbor Search Methods. International Journal of Computer Applications 95, 25 (2014), 1964–1978.
- Ann Arbor Algorithms. 2019. KGraph: A Library for Approximate Nearest Neighbor Search. https://github.com/aaalgo/kgraph
- Amazon Web Services, Inc. 2021. Amazon EC2 High Memory Instances. https://aws.amazon.com/ec2/instance-types/high-memory/
- Estimating Local Intrinsic Dimensionality. In KDD. ACM, 29–38.
- Alexandr Andoni and Piotr Indyk. 2005. LSH Algorithm and Implementation (E2LSH). https://www.mit.edu/~andoni/LSH/
- Approximate nearest neighbor search in high dimensions. In International Congress of Mathematicians 2018. 3287–3318.
- ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems 87 (2020).
- Jens Axboe. 2019. Efficient IO with io_uring. https://kernel.dk/io_uring.pdf
- Erik Bernhardsson. 2022. Annoy: Approximate nearest neighbors in C++/Python optimized for memory usage and loading/saving to disk. https://github.com/spotify/annoy
- Kenneth L. Clarkson. 1994. An algorithm for approximate closest-point queries. In Proc. 10th Annual Symposium on Computational Geometry (SCG ’94). 160–164.
- Locality-sensitive hashing scheme based on p-stable distributions. In SCG. ACM, 253–262.
- Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph. Proc. VLDB Endow. 12, 5 (2019), 461–474.
- Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD Conference. ACM, 541–552.
- Similarity Search in High Dimensions via Hashing. In VLDB. 518–529.
- Impact of storage technology on the efficiency of cluster-based high-dimensional index creation. In 17th Int. Conf. Database Systems for Advanced Applications (DASFAA). 53–64.
- On the Difficulty of Nearest Neighbor Search. In ICML. icml.cc / Omnipress.
- Qiang Huang. 2021. QALSH_Mem: Memory Version of QALSH. https://github.com/HuangQiang/QALSH_Mem
- Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search. Proc. VLDB Endow. 9, 1 (2015), 1–12.
- Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In STOC. ACM, 604–613.
- A Survey on Locality Sensitive Hashing Algorithms and their Applications. arXiv preprint arXiv:2102.08942 (2021).
- SSD Technology Enables Dynamic Maintenance of Persistent High-Dimensional Indexes. In 2016 ACM Int. Conf. Multimedia Retrieval (ICMR). ACM, 347–350.
- NV-Tree: An Efficient Disk-Based Index for Approximate Search in Very Large High-Dimensional Collections. IEEE Trans. Pattern Anal. Mach. Intell. 31, 5 (2009), 869–883.
- Approximate Nearest Neighbor Search on High Dimensional Data - Experiments, Analyses, and Improvement. IEEE Trans. Knowl. Data Eng. 32, 8 (2020), 1475–1488.
- I-LSH: I/O Efficient c-Approximate Nearest Neighbor Search in High-Dimensional Space. In ICDE. IEEE, 1670–1673.
- Kejing Lu and Mineichi Kudo. 2020. R2LSH: A Nearest Neighbor Search Scheme Based on Two-dimensional Projected Spaces. In ICDE. IEEE, 1045–1056.
- Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In VLDB. ACM, 950–961.
- Intelligent Probing for Locality Sensitive Hashing: Multi-Probe LSH and Beyond. Proc. VLDB Endow. 10, 12 (2017), 2021–2024.
- Yury A. Malkov and Dmitry A. Yashunin. 2016. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. CoRR abs/1603.09320 (2016).
- Marius Muja and David G. Lowe. 2009. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. In VISAPP. 331–340.
- Rina Panigrahy. 2006. Entropy based nearest neighbor search in high dimensions. In SODA. 1186–1195.
- Matti Ryynänen and Anssi Klapuri. 2008. Query by humming of midi and audio using locality sensitive hashing. In ICASSP. IEEE, 2249–2252.
- Emerging Usage and Evaluation of Low Latency FLASH. In 2020 IEEE International Memory Workshop (IMW). IEEE, 1–4.
- DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019).
- Yifang Sun. 2015. SRS - Fast Approximate Nearest Neighbor Search in High Dimensional Euclidean Space With a Tiny Index. https://github.com/DBAIWangGroup/SRS
- SRS: Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index. Proc. VLDB Endow. 8, 1 (2014), 1–12.
- Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing. Proc. VLDB Endow. 6, 14 (2013), 1930–1941.
- Approaching DRAM performance by using microsecond-latency flash memory for small-sized random read accesses: A new access method and its graph applications. Proceedings of the VLDB Endowment 14, 8 (2021), 1311–1324.
- Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35, 3 (2010), 20:1–20:46.
- Vikrant Singh Tomar and Richard C. Rose. 2013. Efficient manifold learning for speech recognition using locality sensitive hashing. In ICASSP. IEEE, 6995–6999.
- David Watts. 2022. Lenovo ThinkSystem SR950 Server (Xeon SP Gen 2). https://lenovopress.com/lp1054-thinksystem-sr950-server-xeon-sp-gen-2
- A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB. 194–205.
- SPDK: A Development Kit to Build High Performance Storage Applications. In CloudCom. IEEE Computer Society, 154–161.
- Video anomaly detection based on locality sensitive hashing filters. Pattern Recognit. 59 (2016), 302–311.
- PM-LSH: A Fast and Accurate LSH Framework for High-Dimensional Approximate NN Search. Proc. VLDB Endow. 13, 5 (2020), 643–655.