Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPU Bloom Index Algorithm

Updated 23 February 2026
  • GPU Bloom Index Algorithm is an approximate set-membership filter that leverages GPU-specific optimizations like parallel hashing and coalesced accesses for efficient filtering.
  • It minimizes expensive exhaustive queries and synchronization by discarding up to 95.8% of unsuitable candidates, achieving high throughput on thousands of cores.
  • By tuning parameters such as the number of hash functions and memory layout, it boosts kernel selection and autotuning performance in large-scale computing environments.

A GPU Bloom Index Algorithm is an approximate set-membership data structure optimized for graphics processing units and applied as a large-scale filter or policy index for kernel selection and autotuning. This scheme leverages GPU-specific memory layouts, parallel hashing, and coalesced access patterns to maximize throughput and minimize latency, commonly eliminating the need for expensive exhaustive queries and CPU/GPU synchronization. Recent implementations for tasks such as matrix multiplication autotuning and bulk high-throughput data analytics demonstrate how Bloom filters can efficiently discard up to 95.8% of unsuitable candidates, approach cache or DRAM bandwidth limits, and scale across tens of thousands of cores (Sadasivan et al., 2024, Jünger et al., 17 Dec 2025).

1. Data Structures and Memory Organization

GPU Bloom index algorithms utilize a bit-array structure segmented into fixed-size blocks, further decomposed into machine words (typically 64 bits) to facilitate vectorized and coalesced GPU memory access. In scheduling applications such as Stream-K++, each candidate policy is associated with a distinct Bloom filter, each provisioned for occupancy—mm bits total, up to nn inserted keys (n104n\approx 10^4), with kk hash functions (k=7k=7 for Stream-K++). Filters are aligned to at least 128B or 256B boundaries so that a warp’s accesses are coalesced into a minimal number of global memory transactions. The following table summarizes the organization for a representative Stream-K++ deployment (Sadasivan et al., 2024):

Parameter Typical Value Usage
mm (bits/filter) 10510^5 (12.5 KiB) Capacity, error rate tuning
kk (hashes/key) $7$ Bit-probe trade-off
nn (inserted keys) 10410^4 Maximum expected set size
Words/filter m/64m/64 64-bit word granularity
Policies $7$ One filter per policy

This layout allows multiple filters to operate in parallel, with bit-array words for all filters laid out contiguously to harness warp-level coalesced memory access. In broader GPU Bloom filter contexts, global bit arrays are sectorized into blocks B=m/bB = m/b bits/block, each subdivided into s=B/Ss=B/S words of size SS (usually 64 bits), and aligned to the minimum DRAM sector (32B or 64B) boundary for efficient bulk transfer (Jünger et al., 17 Dec 2025).

2. Hash Function Schemes and Parallelization

Bloom filter GPU designs employ hash function schemes tailored for high-parallel-throughput hardware. Stream-K++ uses a MurmurHash3_x64_128-style mix to yield two 64-bit values (h1,h2)(h_1,h_2), applying double hashing:

for  i[0,k1]:idxi=(h1+ih2)modm\text{for}\; i \in [0,k-1]:\quad \text{idx}_i = (h_1 + i\cdot h_2)\bmod m

This method requires only a single base hash computation per key, minimizing compute and memory indirection. Other GPU designs may employ base hash computations using functions such as xxHash64, then select kk positions by multiplying with precomputed “salts” (odd constants embedded at compile time), eliminating any runtime salt memory load (Jünger et al., 17 Dec 2025). All threads in a warp can compute probe indices for distinct keys in parallel, or—under vectorization—can cooperate to probe and update adjacent words of a block in register- or memory-coalesced fashion.

In Stream-K++, when m<64m < 64 KiB, filter bit-arrays can exploit the constant cache for lower-latency broadcast, but larger filters leverage L2 cache and memory coalescing to amortize DRAM latency.

3. GPU Kernel Implementations and Coalesced Access Patterns

Insert and query operations are implemented as CUDA kernels, exploiting warp-level parallelism and atomic operations. The prototypical kernels operate as follows (Sadasivan et al., 2024):

  • Insert: Each thread computes (h1,h2)(h_1, h_2) for its key, derives kk bit indices for the relevant filter, and performs atomic OR operations to set these bits.
  • Query: Each thread computes bit-probe indices for each candidate filter; if all kk bits are set, membership is possible; if any is clear, the candidate is excluded.

Warps are organized such that each thread operates on a distinct key, and all threads in a warp can access adjacent bits/words, yielding full coalescence. For vectorized implementations, horizontal (Θ\Theta) and vertical (Φ\Phi) vectorization dimensions control thread cooperation and bulk word processing (e.g., multiple words loaded in a single vector instruction). Horizontal vectorization allows all threads in a cooperation group to update (insert) or test (lookup) adjacent words, maximizing atomic throughput and load/store efficiency. Mask generation and bit-probe tests are typically branchless, employing template-unrolled multiplicative hashing and register pooling for hash sharing within sub-warp groups (Jünger et al., 17 Dec 2025).

A GPU insert kernel pseudocode for policy-filtering is given below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
__global__
void BloomInsert(const uint64_t *keys, int num_keys, uint64_t *bloom_bits) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid >= num_keys) return;
  uint64_t h1, h2;
  murmur3_x64_128(keys[tid], &h1, &h2);
  for (int p = 0; p < NUM_POLICIES; ++p) {
    for (int i = 0; i < k; ++i) {
      uint64_t idx = (h1 + i * h2) % BITS_PER_FILTER;
      uint32_t word = idx >> 6;
      uint64_t mask = 1ULL << (idx & 63);
      atomicOr(&bloom_bits[p * WORDS_PER_FILTER + word], mask);
    }
  }
}

The high lookup (~3M lookups/s on MI250X) and insert throughput (11.35×\times baseline for bulk construction on B200 GPUs) are achieved even for DRAM-resident filters when data and code-level coalescing principles are honored (Sadasivan et al., 2024, Jünger et al., 17 Dec 2025).

4. False-Positive Rate Analysis and Parameter Selection

The false-positive rate PfpP_{fp} for Bloom filters is central to both theoretical design and empirical tuning, dictated by the classical formula:

Pfp(1ekn/m)kP_{fp} \approx (1 - e^{-k n / m})^k

The optimal number of hash functions is kopt=(m/n)ln2k_{\mathrm{opt}} = (m/n)\ln 2. For a target false-positive rate ϵ\epsilon and set size nn, an approximate solution for mm is:

mnkln(1ϵ1/k)m \approx -\frac{n k}{\ln(1 - \epsilon^{1/k})}

In Stream-K++, n=104,m=105,k=7n = 10^4, m=10^5, k = 7 yields Pfp0.38%P_{fp} \approx 0.38\%, balancing candidate set reduction and filter size. Observed empirical elimination of unnecessary kernel trials (up to 95.8% of unsuited policies) aligns with theoretical estimates, and 100% true-negative rates are consistently maintained (Sadasivan et al., 2024). The parameters mm, kk, and nn should be tuned per deployment scenario. For GPU-optimized filters, settings such as k=16k=16 and c=m/nc = m/n in the range $8$–$12$ provide Pfp1%P_{fp}\lesssim 1\% at practical memory footprints (Jünger et al., 17 Dec 2025).

5. Performance Models and Optimization Dimensions

Performance is modeled as a function of hash computation cycles (ThashT_{\text{hash}}) and memory access cycles (TmemT_{\text{mem}}):

ThroughputQThash+Tmem\text{Throughput} \approx \frac{Q}{T_{\text{hash}} + T_{\text{mem}}}

where QQ is the number of simultaneous queries. For example, with MI250X bandwidth, uncached memory fetches cost about 600 cycles, L2 hits are 100–200 cycles, and Murmur-based hash computation is 50–100 cycles. Aggregate end-to-end latency for a Bloom query typically falls in the 300–500 cycle ($0.2$–0.3μs0.3\,\mu\mathrm{s}) range, with about $1000$ warps in flight, topping $3$ million queries per second in optimal regimes (Sadasivan et al., 2024).

GPU filter throughput is maximized along three orthogonal axes (Jünger et al., 17 Dec 2025):

  • Vertical vectorization (Φ\Phi): Multiple words are fetched by threads via single wide loads, increasing ILP but constrained by register pressure.
  • Horizontal vectorization (Θ\Theta) and thread cooperation: Threads collaborate to process blocks; for insertions, Θ=s\Theta = s maximizes atomic coalescing.
  • Compute-latency minimization: Branchless multiplicative hashing and compile-time salt inlining reduce per-key cycles; adaptive warp/sub-warp cooperation minimizes redundant calculation.

Table of representative optimization impacts for a B=256B=256 block size (Jünger et al., 17 Dec 2025):

Stage Lookup (L2-resident) Insert (DRAM)
CBF baseline 1.0× 1.0×
→ SBF (blocked) 3.2× 4.5×
→ +multiplicative hashing 1.7× 1.4×
→ +horizontal vectorization 1.0× (Θ=1→2) 1.7×
→ +adaptive cooperation 1.1× 1.2×
Total over baseline ≈9.5× ≈11.6×

Achieved filter lookup throughputs can approach 92%92\% of bandwidth-based speed-of-light for cache-resident cases and maintain double-digit speedups in bulk DRAM scenarios (Jünger et al., 17 Dec 2025).

6. Empirical Results and Tuning Guidelines

In Stream-K++, lookup of the Bloom filter on the CPU costs approximately $0.4\,\upmu$s per key (Murmur3 + seven bit-probes). GEMM-tuning overhead decreases nearly 5×5\times, as only 4.2%4.2\% of policy candidates survive the initial Bloom filtering step. End-to-end, Stream-K++ results in performance improvements up to 43%43\% compared to non-Bloom baselines and is within 20%20\% of globally optimal performance for $60$–97.6%97.6\% of problem sizes (Sadasivan et al., 2024).

Tuning recommendations for modern GPU deployments include (Sadasivan et al., 2024, Jünger et al., 17 Dec 2025):

  • Use m(8m \approx (8–$12)n$ bits for Pfp<1%P_{fp}<1\% and k=6k=6–$8$.
  • Align per-policy bit-arrays to at least $128$ bytes for warp-aligned loads.
  • Keep kk low (8\leq 8) to minimize per-query work.
  • For filters under $64$ KiB, leverage constant memory/L1; otherwise, exploit L2/DRAM cache line batching.
  • Optimize blockDim.x for warp-aligned key/word assignment to maximize coalescing.
  • Pre-warm filter bit-arrays (prefetch) for DRAM-resident scenarios to minimize cold-miss penalties.
  • For large bulk queries, autotune parameters (Θ,Φ)(\Theta, \Phi) per deployment to maximize occupancy and prevent register spill.

7. Broader Applicability and Domain Extensions

The GPU Bloom index pattern—compact bit-arrays, multi-hash probing, and coalesced parallel access—is broadly applicable beyond kernel scheduling. Identified extensions include:

  • Automatic selection and indexing for convolution kernels in deep learning frameworks.
  • Tuning parameter space reduction in graph analytics and community-detection hyperparameter search.
  • Approximate-nearest-neighbor prefiltering in high-dimensional similarity search.
  • Prefiltering invalid memory-layout transformation candidates in program compilers.
  • Join-key filtering in database engines to accelerate join operations.

Moving filter construction and lookup entirely onto the GPU enables completely asynchronous, hardware-accelerated indexing, eliminating the need for CPU-GPU round-trips and enabling real-time filtering in large-scale autotuning and data analytic pipelines (Sadasivan et al., 2024, Jünger et al., 17 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPU Bloom Index Algorithm.