Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search (2409.16576v1)

Published 25 Sep 2024 in cs.IR, cs.DB, and cs.OS

Abstract: Approximate nearest neighbor search (ANNS) has emerged as a crucial component of database and AI infrastructure. Ever-increasing vector datasets pose significant challenges in terms of performance, cost, and accuracy for ANNS services. None of modern ANNS systems can address these issues simultaneously. We present FusionANNS, a high-throughput, low-latency, cost-efficient, and high-accuracy ANNS system for billion-scale datasets using SSDs and only one entry-level GPU. The key idea of FusionANNS lies in CPU/GPU collaborative filtering and re-ranking mechanisms, which significantly reduce I/O operations across CPUs, GPU, and SSDs to break through the I/O performance bottleneck. Specifically, we propose three novel designs: (1) multi-tiered indexing to avoid data swapping between CPUs and GPU, (2) heuristic re-ranking to eliminate unnecessary I/Os and computations while guaranteeing high accuracy, and (3) redundant-aware I/O deduplication to further improve I/O efficiency. We implement FusionANNS and compare it with the state-of-the-art SSD-based ANNS system -- SPANN and GPU-accelerated in-memory ANNS system -- RUMMY. Experimental results show that FusionANNS achieves 1) 9.4-13.1X higher query per second (QPS) and 5.7-8.8X higher cost efficiency compared with SPANN; 2) and 2-4.9X higher QPS and 2.3-6.8X higher cost efficiency compared with RUMMY, while guaranteeing low latency and high accuracy.

Summary

  • The paper introduces a novel multi-tiered indexing that distributes raw, compressed vectors, and IDs across SSDs, GPU memory, and host memory.
  • It employs heuristic re-ranking and redundant-aware I/O deduplication to dynamically optimize accuracy while minimizing costly I/O operations.
  • Empirical results demonstrate a 9.4–13.1× QPS improvement over SPANN and significant cost efficiency gains over existing systems like RUMMY.

Overview of FusionANNS for Billion-Scale Approximate Nearest Neighbor Search

The paper "FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search" addresses the critical performance bottlenecks in Approximate Nearest Neighbor Search (ANNS) services by proposing a solution that leans on the synergy between CPUs, GPUs, and SSD storage. The authors, associated with both the National Engineering Research Center for Big Data Technology and System at Huazhong University of Science and Technology, as well as Huawei Technologies Co., Ltd, articulate the design and implementation of the FusionANNS system, which focuses on high throughput, reduced latency, cost efficiency, and accurate search results.

Key Contributions

The paper introduces three principal innovations within FusionANNS:

  1. Multi-tiered Indexing: This approach effectively mitigates data swapping between CPU and GPU by distributing data across SSDs, GPU's HBM, and host memory. The system primarily stores raw vectors on SSDs, compressed vectors on GPU memory, and vector-IDs in main memory. This segregation minimizes data transfer and maximizes memory utilization.
  2. Heuristic Re-ranking: To enhance accuracy without unnecessary I/O and computation, re-ranking is divided into mini-batches. Each batch assesses if additional ranking would improve accuracy, dynamically adjusting the process.
  3. Redundant-aware I/O Deduplication: By optimizing the storage layout on SSDs to capitalize on spatial locality, the system minimizes read amplification and merges I/O operations within and across mini-batches.

Strong Numerical Results

FusionANNS demonstrates substantial performance gains over existing systems, particularly SPANN and RUMMY:

  • Query Per Second (QPS) Improvement: FusionANNS achieves 9.4 to 13.1 times higher QPS compared to SPANN and 2 to 4.9 times higher compared to RUMMY.
  • Cost Efficiency: The system also shows 5.7 to 8.8 times higher cost efficiency over SPANN and 2.3 to 6.8 times higher over RUMMY.

These numerical advancements are driven by an innovative architecture that minimizes costly I/O operations and leverages the high bandwidth of GPU memory while maintaining SSDs' advantages for bulk storage.

Implications and Future Directions

Practically, FusionANNS addresses the pressing need for scalable, cost-efficient ANNS systems in applications like AI-driven recommendation systems, search engines, and data mining. Theoretically, this paper provides a robust framework for multi-tiered indexing and CPU/GPU cooperation, paving the way for further research into optimizing such hybrid architectures for various data-intensive applications.

Looking ahead, this architecture could spur future developments in further integrating storage technologies, optimizing memory usage, and enhancing the algorithms governing the transfer and processing of data across heterogeneous computing environments. The cooperative processing approach might be fine-tuned to yet manage even larger and more complex datasets, potentially influencing ANNS strategies in emerging fields like real-time data analysis and ultra-large-scale recommendation systems.

Conclusion

FusionANNS represents a significant step towards overcoming the challenges posed by billion-scale vector datasets in ANNS services. Through strategic CPU/GPU collaboration and innovative indexing and deduplication techniques, it establishes a new benchmark in terms of performance and efficiency. While promising substantial immediate gains, it also opens up new avenues for research and optimization in AI infrastructure.