Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WarpCore: A Library for fast Hash Tables on GPUs (2009.07914v2)

Published 16 Sep 2020 in cs.DC

Abstract: Hash tables are ubiquitous. Properties such as an amortized constant time complexity for insertion and querying as well as a compact memory layout make them versatile associative data structures with manifold applications. The rapidly growing amount of data emerging in many fields motivated the need for accelerated hash tables designed for modern parallel architectures. In this work, we exploit the fast memory interface of modern GPUs together with a parallel hashing scheme tailored to improve global memory access patterns, to design WarpCore -- a versatile library of hash table data structures. Unique device-sided operations allow for building high performance data processing pipelines entirely on the GPU. Our implementation achieves up to 1.6 billion inserts and up to 4.3 billion retrievals per second on a single GV100 GPU thereby outperforming the state-of-the-art solutions cuDPP, SlabHash, and NVIDIA RAPIDS cuDF. This performance advantage becomes even more pronounced for high load factors of over $90\%$. To overcome the memory limitation of a single GPU, we scale our approach over a dense NVLink topology which gives us close-to-optimal weak scaling on DGX servers. We further show how WarpCore can be used for accelerating a real world bioinformatics application (metagenomic classification) with speedups of over two orders-of-magnitude against state-of-the-art CPU-based solutions. WC is written in C++/CUDA-C and is openly available at https://github.com/sleeepyjack/warpcore.

Citations (15)

Summary

  • The paper introduces WarpCore, a GPU-accelerated library that optimizes hash table operations through innovative probing techniques.
  • It employs a cooperative probing scheme and modular design to achieve up to 1.6 billion inserts and 4.3 billion retrievals per second on a single GV100 GPU.
  • The work demonstrates scalable, high-performance solutions for dense workloads, significantly accelerating tasks like metagenomic classification.

Overview of WarpCore: A Library for Fast Hash Tables on GPUs

The paper introduces WarpCore, a high-performance library designed for leveraging the computational power of modern GPUs to accelerate hash table operations. Hash tables are essential data structures frequently utilized for their efficient storage and retrieval capabilities, particularly valued for their expected constant-time complexity. However, traditional CPU-based implementations are often hampered by inefficient memory access patterns and the dynamic nature of modern data environments, which necessitate sophisticated, scalable solutions that efficiently harness modern hardware capabilities.

Technical Contributions and Performance

WarpCore distinguishes itself by its utilization of the GPU's fast memory interfaces alongside a parallel hashing scheme optimized for improving global memory access patterns. This strategy enables WarpCore to outperform existing solutions like cuDPP, SlabHash, and NVIDIA RAPIDS cuDF significantly in both throughput and memory efficiency. Notably, on a single GV100 GPU, WarpCore achieves an impressive 1.6 billion inserts and 4.3 billion retrievals per second, with performance advantages becoming pronounced at high load factors exceeding 90%.

WarpCore supports both single-value and multi-value hash tables through a series of innovative GPU-based data structures and algorithms. The introduced cooperative probing scheme (COPS) enhances the efficiency of these operations by aligning with the SIMT (Single Instruction Multiple Threads) execution model of GPUs, ensuring data locality, and optimizing the probing sequences.

Innovations and Design Approaches

  1. Modular Framework: WarpCore offers versatile components that allow for the customization of memory layouts, hashing functions, and probing schemes. It supports both the Array of Structs (AOS) and Struct of Arrays (SOA) layouts, capitalizing on the strengths of each for various data handling scenarios.
  2. Advanced Hashing Techniques: WarpCore's use of open addressing and the cooperative probing scheme (COPS) aids in the swift handling of hashing operations. The probing scheme strategically mixes linear probing internally with double hashing externally, reducing primary clustering effects and improving probing efficiency especially at high load factors.
  3. Support for Concurrent Execution: By employing host-sided and device-sided interface functions which accommodate both bulk and individual operations, WarpCore allows for highly parallel processing and concurrent execution, critical for maintaining speed and efficiency in data-intensive applications.
  4. Scalability Across GPUs: The library adeptly handles scalability constraints by providing mechanisms to distribute data across multiple GPUs connected by NVLink, achieving near-optimal weak scaling on systems like DGX servers. This scalability is crucial for applications like bioinformatics where datasets are often too large to fit into a single GPU.

Application and Implications

Among the highlighted applications, WarpCore significantly accelerates metagenomic classification tasks. This showcases its practical value in fields like bioinformatics where the rapid handling and processing of massive genomic datasets are paramount. WarpCore delivers over a two-orders-of-magnitude speedup against state-of-the-art CPU-based solutions in constructing genomic databases.

Potential Future Directions

The paper alludes to potential extensions to distributed GPU clusters, which could further amplify WarpCore's ability to handle even larger datasets. Additionally, while the current implementation and optimization are focused on NVIDIA GPUs, the foundational concepts demonstrated could be adapted for other accelerator architectures, such as FPGAs, broadening the applicability of the research.

Overall, WarpCore is a significant step towards addressing the limitations of traditional hash table performance on GPUs, offering a robust, scalable solution tailored to modern demands for speed and efficiency in processing large datasets. The paper provides an essential foundation for future research and development in GPU-accelerated data structures and algorithms.

Github Logo Streamline Icon: https://streamlinehq.com

HackerNews