BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU (2401.11324v2)

Published 20 Jan 2024 in cs.DC

Abstract: Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature. The growing volume and dimensionality of data necessitates designing scalable techniques for ANNS. To this end, the prior art has explored parallelizing graph-based ANNS on GPU leveraging its massive parallelism. The current state-of-the-art GPU-based ANNS algorithms either (i) require both the dataset and the generated graph index to reside entirely in the GPU memory, or (ii) they partition the dataset into small independent shards, each of which can fit in GPU memory, and perform the search on these shards on the GPU. While the first approach fails to handle large datasets due to the limited memory available on the GPU, the latter delivers poor performance on large datasets due to high data traffic over the low-bandwidth PCIe bus. We introduce BANG, a first-of-its-kind technique for graph-based ANNS on GPU for billion-scale datasets that cannot entirely fit in the GPU memory. BANG stands out by harnessing a compressed form of the dataset on a single GPU to perform distance computations while efficiently accessing the graph index kept on the host memory, enabling efficient ANNS on large graphs within the limited GPU memory. BANG incorporates highly optimized GPU kernels and proceeds in phases that run concurrently on the GPU and CPU. Notably, on the billion-size datasets, we achieve throughputs 40x-200x more than the competing methods for a high recall value of 0.9. Additionally, BANG is the best in cost- and power-efficiency among the competing methods from the recent Billion-Scale Approximate Nearest Neighbour Search Challenge.

References (46)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel hybrid CPU-GPU architecture that leverages compressed vector representations to enable efficient billion-scale ANN search.
It employs Product Quantization to compress dataset vectors, reducing the GPU memory footprint and optimizing distance computations.
Evaluation on benchmark datasets shows 40x–200x throughput improvements with high recall, demonstrating the method's practical efficiency.

Billion-Scale Approximate Nearest Neighbor Search Using a Single GPU

The paper "Billion-Scale Approximate Nearest Neighbor Search Using a Single GPU" presents a novel method for Approximate Nearest Neighbor Search (ANNS) that efficiently operates on billion-scale datasets using a single GPU. The proposed method addresses the core challenges posed by large datasets and the limitations of GPU memory while maintaining high recall rates and throughput.

Key Contributions

Hybrid Architecture: The method leverages both CPU and GPU architectures for different tasks. The core innovation lies in using compressed data on the GPU for distance calculations while maintaining the graph structure on the CPU. This dual approach balances the workload, optimizes resource usage, and reduces data transfer overhead.
Compressed Vector Representation: By employing Product Quantization (PQ), the method compresses the dataset vectors before they are processed on the GPU. This compression significantly reduces the memory footprint on the GPU, enabling the processing of billion-scale datasets.
Optimized GPU Kernels: The implementation includes highly optimized GPU kernels for various operations such as distance computations, sorting, and updating worklists. These optimizations ensure the efficient utilization of GPU resources and maximize the throughput of the ANNS process.
CPU-GPU Synchronization: The method minimizes the data transfer between CPU and GPU by overlapping communication with computation. Advanced CUDA features like asynchronous memcpy APIs and streams are used to hide data transfer latencies and keep both the CPU and GPU occupied concurrently.
Evaluation on Benchmark Datasets: The method is evaluated on ten popular ANN benchmark datasets using a single NVIDIA Ampere A100 GPU. The results demonstrate substantial performance improvements over existing state-of-the-art methods, especially for large datasets.

Detailed Overview

Background and Motivation

ANNS is a crucial algorithm in many fields, including information retrieval, pattern recognition, and data mining. The increasing volume and dimensionality of data necessitate scalable techniques for ANNS. Traditional graph-based ANNS algorithms have shown practical efficiency on large datasets, but their GPU-based implementations face challenges related to GPU memory limitations and CPU-GPU data transfer bottlenecks.

Methodology

The proposed method addresses these challenges by using a hybrid CPU-GPU approach where:

The compressed vector data is processed on the GPU for fast distance computations.
The graph structure is maintained on the CPU to handle the large memory requirements of billion-scale datasets.

Core Algorithm

The ANNS process is divided into three primary stages:

Distance Table Construction: Pre-computation of distances between query points and cluster centroids using PQ.
ANN Search: Iterative search on the graph structure, using priority worklists and optimized distance calculations.
Re-ranking: Final adjustment of the nearest neighbor list based on exact distances to improve recall.

Performance Evaluation

The method's performance was evaluated on datasets such as SIFT1B, DEEP1B, and SPACEV1B, achieving throughputs 40x-200x higher than competing methods for a high recall of 0.9. Additionally, evaluations on smaller datasets demonstrated that the method is almost always faster or comparable to state-of-the-art methods.

Implications and Future Work

This research presents significant implications for the practical application of ANNS, particularly in resource-constrained environments where the use of a single GPU is a necessity. The approach of combining compressed data computations with optimized parallel processing opens new avenues for efficient data processing at scale.

Future developments could focus on further reducing the memory footprint, improving compression techniques, and extending the method to multi-GPU systems. Additionally, exploring different graph structures and optimization algorithms could further enhance the performance and applicability of the method.

In conclusion, this paper introduces an efficient and scalable solution to ANNS challenges on a single GPU, demonstrating remarkable improvements in throughput and recall for billion-scale datasets. These advancements pave the way for more practical and widespread use of ANNS in large-scale data processing applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Underfox3/status/1749802759292076278

https://twitter.com/DevRelPuzzle/status/1783406946478895556

https://twitter.com/C8Monkey/status/1781155775517245535

YouTube

Show All Videos