BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU (2401.11324v2)
Abstract: Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature. The growing volume and dimensionality of data necessitates designing scalable techniques for ANNS. To this end, the prior art has explored parallelizing graph-based ANNS on GPU leveraging its massive parallelism. The current state-of-the-art GPU-based ANNS algorithms either (i) require both the dataset and the generated graph index to reside entirely in the GPU memory, or (ii) they partition the dataset into small independent shards, each of which can fit in GPU memory, and perform the search on these shards on the GPU. While the first approach fails to handle large datasets due to the limited memory available on the GPU, the latter delivers poor performance on large datasets due to high data traffic over the low-bandwidth PCIe bus. We introduce BANG, a first-of-its-kind technique for graph-based ANNS on GPU for billion-scale datasets that cannot entirely fit in the GPU memory. BANG stands out by harnessing a compressed form of the dataset on a single GPU to perform distance computations while efficiently accessing the graph index kept on the host memory, enabling efficient ANNS on large graphs within the limited GPU memory. BANG incorporates highly optimized GPU kernels and proceeds in phases that run concurrently on the GPU and CPU. Notably, on the billion-size datasets, we achieve throughputs 40x-200x more than the competing methods for a high recall value of 0.9. Additionally, BANG is the best in cost- and power-efficiency among the competing methods from the recent Billion-Scale Approximate Nearest Neighbour Search Challenge.
- Accelerated Approximate Nearest Neighbors Search Through Hierarchical Product Quantization. In 2019 International Conference on Field-Programmable Technology (ICFPT). 90–98. https://doi.org/10.1109/ICFPT47387.2019.00019
- Alexandr Andoni and Piotr Indyk. 2008. Near-optimal hashing algorithms for near neighbor problem in high dimension. Commun. ACM 51, 1 (2008), 117–122.
- Practical and optimal LSH for angular distance. Advances in neural information processing systems 28 (2015).
- CUDA (Compute Unified Device Architecture). 2023. CUDA Programing Model. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Retrieved: 2023-12-28.
- ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. arXiv:1807.05614 [cs.IR]
- Burton H. Bloom. 1970. Space/Time Trade-Offs in Hash Coding with Allowable Errors. Commununication of the ACM 13, 7 (1970), 422–426. https://doi.org/10.1145/362686.362692
- Training Invariant SVMs Using Selective Sampling. 301–320.
- Lawrence Cayton. 2008. Fast nearest neighbor retrieval for bregman divergences. In Proceedings of the 25th international conference on Machine learning. 112–119.
- Hierarchical quantization for billion-scale similarity retrieval on GPUs. Computers and Electrical Engineering 90 (2021), 107002. https://doi.org/10.1016/j.compeleceng.2021.107002
- TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s. arXiv:2206.14286 [cs.PF]
- SpaceV Contributors. 2023. SPACEV1B: A billion-Scale vector dataset for text descriptors. https://github.com/microsoft/SPTAG/tree/main/datasets/SPACEV1B. Accessed: Dec 30, 2023.
- Fast exact max-kernel search. In Proceedings of the 2013 SIAM International Conference on Data Mining. SIAM, 1–9.
- Elizabeth H. Cuthill and John M. McKee. 1969. Reducing the bandwidth of sparse symmetric matrices. In ACM ’69. https://api.semanticscholar.org/CorpusID:18143635
- Scaling Graph-Based ANNS Algorithms to Billion-Size Datasets: A Comparative Analysis. arXiv:2305.04359 [cs.IR]
- D. Dua and C. Graff. [n.d.]. UCI machine learning repository. http://archive.ics.uci.edu/ml
- Steven Fortune. 1995. Voronoi diagrams and Delaunay triangulations. Computing in Euclidean geometry. World Scientific (1995), 225–265. https://doi.org/10.1142/9789812831699_0007
- Fast approximate nearest neighbor search with the navigating spreading-out graph. arXiv preprint arXiv:1707.00143 (2017).
- J. Alan George. 1971. Computer implementation of the finite element method. Ph.D. Dissertation. Stanford University, USA. https://searchworks.stanford.edu/view/2198775
- Fowler-noll-vo hash function. http://isthe.com/chongo/tech/comp/fnv/.
- GPU Merge Path: A GPU Merging Algorithm. In Proceedings of the 26th ACM International Conference on Supercomputing (San Servolo Island, Venice, Italy) (ICS ’12). Association for Computing Machinery, New York, NY, USA, 331–340. https://doi.org/10.1145/2304576.2304621
- Ggnn: Graph-based gpu nearest neighbor search. IEEE Transactions on Big Data (2022).
- GGNN: Graph-based GPU Nearest Neighbor Search. https://github.com/cgtuebingen/ggnn. Accessed: Dec 30, 2023.
- CUDA C Best Practices Guide. 2023. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/.
- Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (Dallas, Texas, USA) (STOC ’98). Association for Computing Machinery, New York, NY, USA, 604–613. https://doi.org/10.1145/276698.276876
- DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf
- FAISS Wiki. https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index. Accessed: Oct 2, 2023.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
- Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1 (2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57
- Searching in one billion vectors: Re-rank with source coding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 861–864. https://doi.org/10.1109/ICASSP.2011.5946540
- VStore: In-Storage Graph Based Vector Search Accelerator. In Proceedings of the 59th ACM/IEEE Design Automation Conference (San Francisco, California) (DAC ’22). Association for Computing Machinery, New York, NY, USA, 997–1002. https://doi.org/10.1145/3489517.3530560
- Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems 45 (2014), 61–68.
- Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836.
- Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision 42 (2001), 145–175.
- CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs. ArXiv abs/2308.15136 (2023). https://api.semanticscholar.org/CorpusID:261276680
- Optimizing FPGA-Based Accelerator Design for Large-Scale Molecular Similarity Search (Special Session Paper). In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) (Munich, Germany). IEEE Press, 1–7. https://doi.org/10.1109/ICCAD51958.2021.9643528
- GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162
- Reduction. 2023. CUB library. https://nvlabs.github.io/cub/. Retrieved: 2023-12-28.
- Online multimedia retrieval on CPU–GPU platforms with adaptive work partition. J. Parallel and Distrib. Comput. 148 (2021), 31–45. https://doi.org/10.1016/j.jpdc.2020.10.001
- CUDA Streams. 2023. Streams Simplify Concurrency. https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/. Retrieved: 2023-12-28.
- Graph-based Approximate NN Search: A Revisit. arXiv:2204.00824 [cs.IR]
- A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search. arXiv:2101.12631 [cs.IR]
- Efficient Large-Scale Approximate Nearest Neighbor Search on the GPU. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Artem Babenko Yandex and Victor Lempitsky. 2016. Efficient Indexing of Billion-Scale Datasets of Deep Descriptors. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2055–2063. https://doi.org/10.1109/CVPR.2016.226
- GPU-accelerated Proximity Graph Approximate Nearest Neighbor Search and Construction. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). 552–564. https://doi.org/10.1109/ICDE53745.2022.00046
- Efficient Large-Scale Approximate Nearest Neighbor Search on OpenCL FPGA. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4924–4932. https://doi.org/10.1109/CVPR.2018.00517
- SONG: Approximate nearest neighbor search on GPU. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1033–1044.