Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU (2401.11324v2)

Published 20 Jan 2024 in cs.DC

Abstract: Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature. The growing volume and dimensionality of data necessitates designing scalable techniques for ANNS. To this end, the prior art has explored parallelizing graph-based ANNS on GPU leveraging its massive parallelism. The current state-of-the-art GPU-based ANNS algorithms either (i) require both the dataset and the generated graph index to reside entirely in the GPU memory, or (ii) they partition the dataset into small independent shards, each of which can fit in GPU memory, and perform the search on these shards on the GPU. While the first approach fails to handle large datasets due to the limited memory available on the GPU, the latter delivers poor performance on large datasets due to high data traffic over the low-bandwidth PCIe bus. We introduce BANG, a first-of-its-kind technique for graph-based ANNS on GPU for billion-scale datasets that cannot entirely fit in the GPU memory. BANG stands out by harnessing a compressed form of the dataset on a single GPU to perform distance computations while efficiently accessing the graph index kept on the host memory, enabling efficient ANNS on large graphs within the limited GPU memory. BANG incorporates highly optimized GPU kernels and proceeds in phases that run concurrently on the GPU and CPU. Notably, on the billion-size datasets, we achieve throughputs 40x-200x more than the competing methods for a high recall value of 0.9. Additionally, BANG is the best in cost- and power-efficiency among the competing methods from the recent Billion-Scale Approximate Nearest Neighbour Search Challenge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Accelerated Approximate Nearest Neighbors Search Through Hierarchical Product Quantization. In 2019 International Conference on Field-Programmable Technology (ICFPT). 90–98. https://doi.org/10.1109/ICFPT47387.2019.00019
  2. Alexandr Andoni and Piotr Indyk. 2008. Near-optimal hashing algorithms for near neighbor problem in high dimension. Commun. ACM 51, 1 (2008), 117–122.
  3. Practical and optimal LSH for angular distance. Advances in neural information processing systems 28 (2015).
  4. CUDA (Compute Unified Device Architecture). 2023. CUDA Programing Model. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Retrieved: 2023-12-28.
  5. ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. arXiv:1807.05614 [cs.IR]
  6. Burton H. Bloom. 1970. Space/Time Trade-Offs in Hash Coding with Allowable Errors. Commununication of the ACM 13, 7 (1970), 422–426. https://doi.org/10.1145/362686.362692
  7. Training Invariant SVMs Using Selective Sampling. 301–320.
  8. Lawrence Cayton. 2008. Fast nearest neighbor retrieval for bregman divergences. In Proceedings of the 25th international conference on Machine learning. 112–119.
  9. Hierarchical quantization for billion-scale similarity retrieval on GPUs. Computers and Electrical Engineering 90 (2021), 107002. https://doi.org/10.1016/j.compeleceng.2021.107002
  10. TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s. arXiv:2206.14286 [cs.PF]
  11. SpaceV Contributors. 2023. SPACEV1B: A billion-Scale vector dataset for text descriptors. https://github.com/microsoft/SPTAG/tree/main/datasets/SPACEV1B. Accessed: Dec 30, 2023.
  12. Fast exact max-kernel search. In Proceedings of the 2013 SIAM International Conference on Data Mining. SIAM, 1–9.
  13. Elizabeth H. Cuthill and John M. McKee. 1969. Reducing the bandwidth of sparse symmetric matrices. In ACM ’69. https://api.semanticscholar.org/CorpusID:18143635
  14. Scaling Graph-Based ANNS Algorithms to Billion-Size Datasets: A Comparative Analysis. arXiv:2305.04359 [cs.IR]
  15. D. Dua and C. Graff. [n.d.]. UCI machine learning repository. http://archive.ics.uci.edu/ml
  16. Steven Fortune. 1995. Voronoi diagrams and Delaunay triangulations. Computing in Euclidean geometry. World Scientific (1995), 225–265. https://doi.org/10.1142/9789812831699_0007
  17. Fast approximate nearest neighbor search with the navigating spreading-out graph. arXiv preprint arXiv:1707.00143 (2017).
  18. J. Alan George. 1971. Computer implementation of the finite element method. Ph.D. Dissertation. Stanford University, USA. https://searchworks.stanford.edu/view/2198775
  19. Fowler-noll-vo hash function. http://isthe.com/chongo/tech/comp/fnv/.
  20. GPU Merge Path: A GPU Merging Algorithm. In Proceedings of the 26th ACM International Conference on Supercomputing (San Servolo Island, Venice, Italy) (ICS ’12). Association for Computing Machinery, New York, NY, USA, 331–340. https://doi.org/10.1145/2304576.2304621
  21. Ggnn: Graph-based gpu nearest neighbor search. IEEE Transactions on Big Data (2022).
  22. GGNN: Graph-based GPU Nearest Neighbor Search. https://github.com/cgtuebingen/ggnn. Accessed: Dec 30, 2023.
  23. CUDA C Best Practices Guide. 2023. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/.
  24. Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (Dallas, Texas, USA) (STOC ’98). Association for Computing Machinery, New York, NY, USA, 604–613. https://doi.org/10.1145/276698.276876
  25. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf
  26. FAISS Wiki. https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index. Accessed: Oct 2, 2023.
  27. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
  28. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1 (2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57
  29. Searching in one billion vectors: Re-rank with source coding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 861–864. https://doi.org/10.1109/ICASSP.2011.5946540
  30. VStore: In-Storage Graph Based Vector Search Accelerator. In Proceedings of the 59th ACM/IEEE Design Automation Conference (San Francisco, California) (DAC ’22). Association for Computing Machinery, New York, NY, USA, 997–1002. https://doi.org/10.1145/3489517.3530560
  31. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems 45 (2014), 61–68.
  32. Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836.
  33. Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision 42 (2001), 145–175.
  34. CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs. ArXiv abs/2308.15136 (2023). https://api.semanticscholar.org/CorpusID:261276680
  35. Optimizing FPGA-Based Accelerator Design for Large-Scale Molecular Similarity Search (Special Session Paper). In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) (Munich, Germany). IEEE Press, 1–7. https://doi.org/10.1109/ICCAD51958.2021.9643528
  36. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162
  37. Reduction. 2023. CUB library. https://nvlabs.github.io/cub/. Retrieved: 2023-12-28.
  38. Online multimedia retrieval on CPU–GPU platforms with adaptive work partition. J. Parallel and Distrib. Comput. 148 (2021), 31–45. https://doi.org/10.1016/j.jpdc.2020.10.001
  39. CUDA Streams. 2023. Streams Simplify Concurrency. https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/. Retrieved: 2023-12-28.
  40. Graph-based Approximate NN Search: A Revisit. arXiv:2204.00824 [cs.IR]
  41. A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search. arXiv:2101.12631 [cs.IR]
  42. Efficient Large-Scale Approximate Nearest Neighbor Search on the GPU. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  43. Artem Babenko Yandex and Victor Lempitsky. 2016. Efficient Indexing of Billion-Scale Datasets of Deep Descriptors. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2055–2063. https://doi.org/10.1109/CVPR.2016.226
  44. GPU-accelerated Proximity Graph Approximate Nearest Neighbor Search and Construction. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). 552–564. https://doi.org/10.1109/ICDE53745.2022.00046
  45. Efficient Large-Scale Approximate Nearest Neighbor Search on OpenCL FPGA. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4924–4932. https://doi.org/10.1109/CVPR.2018.00517
  46. SONG: Approximate nearest neighbor search on GPU. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1033–1044.
Citations (2)

Summary

  • The paper introduces a novel hybrid CPU-GPU architecture that leverages compressed vector representations to enable efficient billion-scale ANN search.
  • It employs Product Quantization to compress dataset vectors, reducing the GPU memory footprint and optimizing distance computations.
  • Evaluation on benchmark datasets shows 40x–200x throughput improvements with high recall, demonstrating the method's practical efficiency.

Billion-Scale Approximate Nearest Neighbor Search Using a Single GPU

The paper "Billion-Scale Approximate Nearest Neighbor Search Using a Single GPU" presents a novel method for Approximate Nearest Neighbor Search (ANNS) that efficiently operates on billion-scale datasets using a single GPU. The proposed method addresses the core challenges posed by large datasets and the limitations of GPU memory while maintaining high recall rates and throughput.

Key Contributions

  1. Hybrid Architecture: The method leverages both CPU and GPU architectures for different tasks. The core innovation lies in using compressed data on the GPU for distance calculations while maintaining the graph structure on the CPU. This dual approach balances the workload, optimizes resource usage, and reduces data transfer overhead.
  2. Compressed Vector Representation: By employing Product Quantization (PQ), the method compresses the dataset vectors before they are processed on the GPU. This compression significantly reduces the memory footprint on the GPU, enabling the processing of billion-scale datasets.
  3. Optimized GPU Kernels: The implementation includes highly optimized GPU kernels for various operations such as distance computations, sorting, and updating worklists. These optimizations ensure the efficient utilization of GPU resources and maximize the throughput of the ANNS process.
  4. CPU-GPU Synchronization: The method minimizes the data transfer between CPU and GPU by overlapping communication with computation. Advanced CUDA features like asynchronous memcpy APIs and streams are used to hide data transfer latencies and keep both the CPU and GPU occupied concurrently.
  5. Evaluation on Benchmark Datasets: The method is evaluated on ten popular ANN benchmark datasets using a single NVIDIA Ampere A100 GPU. The results demonstrate substantial performance improvements over existing state-of-the-art methods, especially for large datasets.

Detailed Overview

Background and Motivation

ANNS is a crucial algorithm in many fields, including information retrieval, pattern recognition, and data mining. The increasing volume and dimensionality of data necessitate scalable techniques for ANNS. Traditional graph-based ANNS algorithms have shown practical efficiency on large datasets, but their GPU-based implementations face challenges related to GPU memory limitations and CPU-GPU data transfer bottlenecks.

Methodology

The proposed method addresses these challenges by using a hybrid CPU-GPU approach where:

  • The compressed vector data is processed on the GPU for fast distance computations.
  • The graph structure is maintained on the CPU to handle the large memory requirements of billion-scale datasets.

Core Algorithm

The ANNS process is divided into three primary stages:

  1. Distance Table Construction: Pre-computation of distances between query points and cluster centroids using PQ.
  2. ANN Search: Iterative search on the graph structure, using priority worklists and optimized distance calculations.
  3. Re-ranking: Final adjustment of the nearest neighbor list based on exact distances to improve recall.

Performance Evaluation

The method's performance was evaluated on datasets such as SIFT1B, DEEP1B, and SPACEV1B, achieving throughputs 40x-200x higher than competing methods for a high recall of 0.9. Additionally, evaluations on smaller datasets demonstrated that the method is almost always faster or comparable to state-of-the-art methods.

Implications and Future Work

This research presents significant implications for the practical application of ANNS, particularly in resource-constrained environments where the use of a single GPU is a necessity. The approach of combining compressed data computations with optimized parallel processing opens new avenues for efficient data processing at scale.

Future developments could focus on further reducing the memory footprint, improving compression techniques, and extending the method to multi-GPU systems. Additionally, exploring different graph structures and optimization algorithms could further enhance the performance and applicability of the method.

In conclusion, this paper introduces an efficient and scalable solution to ANNS challenges on a single GPU, demonstrating remarkable improvements in throughput and recall for billion-scale datasets. These advancements pave the way for more practical and widespread use of ANNS in large-scale data processing applications.

Youtube Logo Streamline Icon: https://streamlinehq.com