GPU Bloom Index Algorithm

Updated 23 February 2026

GPU Bloom Index Algorithm is an approximate set-membership filter that leverages GPU-specific optimizations like parallel hashing and coalesced accesses for efficient filtering.
It minimizes expensive exhaustive queries and synchronization by discarding up to 95.8% of unsuitable candidates, achieving high throughput on thousands of cores.
By tuning parameters such as the number of hash functions and memory layout, it boosts kernel selection and autotuning performance in large-scale computing environments.

A GPU Bloom Index Algorithm is an approximate set-membership data structure optimized for graphics processing units and applied as a large-scale filter or policy index for kernel selection and autotuning. This scheme leverages GPU-specific memory layouts, parallel hashing, and coalesced access patterns to maximize throughput and minimize latency, commonly eliminating the need for expensive exhaustive queries and CPU/GPU synchronization. Recent implementations for tasks such as matrix multiplication autotuning and bulk high-throughput data analytics demonstrate how Bloom filters can efficiently discard up to 95.8% of unsuitable candidates, approach cache or DRAM bandwidth limits, and scale across tens of thousands of cores (Sadasivan et al., 2024, Jünger et al., 17 Dec 2025).

1. Data Structures and Memory Organization

GPU Bloom index algorithms utilize a bit-array structure segmented into fixed-size blocks, further decomposed into machine words (typically 64 bits) to facilitate vectorized and coalesced GPU memory access. In scheduling applications such as Stream-K++, each candidate policy is associated with a distinct Bloom filter, each provisioned for occupancy— $m$ bits total, up to $n$ inserted keys ( $n\approx 10^4$ ), with $k$ hash functions ( $k=7$ for Stream-K++). Filters are aligned to at least 128B or 256B boundaries so that a warp’s accesses are coalesced into a minimal number of global memory transactions. The following table summarizes the organization for a representative Stream-K++ deployment (Sadasivan et al., 2024):

Parameter	Typical Value	Usage
$m$ (bits/filter)	$10^5$ (12.5 KiB)	Capacity, error rate tuning
$k$ (hashes/key)	$7$	Bit-probe trade-off
$n$ (inserted keys)	$10^4$	Maximum expected set size
Words/filter	$m/64$	64-bit word granularity
Policies	$7$	One filter per policy

This layout allows multiple filters to operate in parallel, with bit-array words for all filters laid out contiguously to harness warp-level coalesced memory access. In broader GPU Bloom filter contexts, global bit arrays are sectorized into blocks $B = m/b$ bits/block, each subdivided into $s=B/S$ words of size $S$ (usually 64 bits), and aligned to the minimum DRAM sector (32B or 64B) boundary for efficient bulk transfer (Jünger et al., 17 Dec 2025).

2. Hash Function Schemes and Parallelization

Bloom filter GPU designs employ hash function schemes tailored for high-parallel-throughput hardware. Stream-K++ uses a MurmurHash3_x64_128-style mix to yield two 64-bit values $(h_1,h_2)$ , applying double hashing:

$\text{for}\; i \in [0,k-1]:\quad \text{idx}_i = (h_1 + i\cdot h_2)\bmod m$

This method requires only a single base hash computation per key, minimizing compute and memory indirection. Other GPU designs may employ base hash computations using functions such as xxHash64, then select $k$ positions by multiplying with precomputed “salts” (odd constants embedded at compile time), eliminating any runtime salt memory load (Jünger et al., 17 Dec 2025). All threads in a warp can compute probe indices for distinct keys in parallel, or—under vectorization—can cooperate to probe and update adjacent words of a block in register- or memory-coalesced fashion.

In Stream-K++, when $m < 64$ KiB, filter bit-arrays can exploit the constant cache for lower-latency broadcast, but larger filters leverage L2 cache and memory coalescing to amortize DRAM latency.

3. GPU Kernel Implementations and Coalesced Access Patterns

Insert and query operations are implemented as CUDA kernels, exploiting warp-level parallelism and atomic operations. The prototypical kernels operate as follows (Sadasivan et al., 2024):

Insert: Each thread computes $(h_1, h_2)$ for its key, derives $k$ bit indices for the relevant filter, and performs atomic OR operations to set these bits.
Query: Each thread computes bit-probe indices for each candidate filter; if all $k$ bits are set, membership is possible; if any is clear, the candidate is excluded.

Warps are organized such that each thread operates on a distinct key, and all threads in a warp can access adjacent bits/words, yielding full coalescence. For vectorized implementations, horizontal ( $\Theta$ ) and vertical ( $\Phi$ ) vectorization dimensions control thread cooperation and bulk word processing (e.g., multiple words loaded in a single vector instruction). Horizontal vectorization allows all threads in a cooperation group to update (insert) or test (lookup) adjacent words, maximizing atomic throughput and load/store efficiency. Mask generation and bit-probe tests are typically branchless, employing template-unrolled multiplicative hashing and register pooling for hash sharing within sub-warp groups (Jünger et al., 17 Dec 2025).

A GPU insert kernel pseudocode for policy-filtering is given below:

__global__
void BloomInsert(const uint64_t *keys, int num_keys, uint64_t *bloom_bits) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid >= num_keys) return;
  uint64_t h1, h2;
  murmur3_x64_128(keys[tid], &h1, &h2);
  for (int p = 0; p < NUM_POLICIES; ++p) {
    for (int i = 0; i < k; ++i) {
      uint64_t idx = (h1 + i * h2) % BITS_PER_FILTER;
      uint32_t word = idx >> 6;
      uint64_t mask = 1ULL << (idx & 63);
      atomicOr(&bloom_bits[p * WORDS_PER_FILTER + word], mask);
    }
  }
}

The high lookup (~3M lookups/s on MI250X) and insert throughput (11.35 $\times$ baseline for bulk construction on B200 GPUs) are achieved even for DRAM-resident filters when data and code-level coalescing principles are honored (Sadasivan et al., 2024, Jünger et al., 17 Dec 2025).

4. False-Positive Rate Analysis and Parameter Selection

The false-positive rate $P_{fp}$ for Bloom filters is central to both theoretical design and empirical tuning, dictated by the classical formula:

$P_{fp} \approx (1 - e^{-k n / m})^k$

The optimal number of hash functions is $k_{\mathrm{opt}} = (m/n)\ln 2$ . For a target false-positive rate $\epsilon$ and set size $n$ , an approximate solution for $m$ is:

$m \approx -\frac{n k}{\ln(1 - \epsilon^{1/k})}$

In Stream-K++, $n = 10^4, m=10^5, k = 7$ yields $P_{fp} \approx 0.38\%$ , balancing candidate set reduction and filter size. Observed empirical elimination of unnecessary kernel trials (up to 95.8% of unsuited policies) aligns with theoretical estimates, and 100% true-negative rates are consistently maintained (Sadasivan et al., 2024). The parameters $m$ , $k$ , and $n$ should be tuned per deployment scenario. For GPU-optimized filters, settings such as $k=16$ and $c = m/n$ in the range $8$–$12$ provide $P_{fp}\lesssim 1\%$ at practical memory footprints (Jünger et al., 17 Dec 2025).

5. Performance Models and Optimization Dimensions

Performance is modeled as a function of hash computation cycles ( $T_{\text{hash}}$ ) and memory access cycles ( $T_{\text{mem}}$ ):

$\text{Throughput} \approx \frac{Q}{T_{\text{hash}} + T_{\text{mem}}}$

where $Q$ is the number of simultaneous queries. For example, with MI250X bandwidth, uncached memory fetches cost about 600 cycles, L2 hits are 100–200 cycles, and Murmur-based hash computation is 50–100 cycles. Aggregate end-to-end latency for a Bloom query typically falls in the 300–500 cycle ($0.2$– $0.3\,\mu\mathrm{s}$ ) range, with about $1000$ warps in flight, topping $3$ million queries per second in optimal regimes (Sadasivan et al., 2024).

GPU filter throughput is maximized along three orthogonal axes (Jünger et al., 17 Dec 2025):

Vertical vectorization ( $\Phi$ ): Multiple words are fetched by threads via single wide loads, increasing ILP but constrained by register pressure.
Horizontal vectorization ( $\Theta$ ) and thread cooperation: Threads collaborate to process blocks; for insertions, $\Theta = s$ maximizes atomic coalescing.
Compute-latency minimization: Branchless multiplicative hashing and compile-time salt inlining reduce per-key cycles; adaptive warp/sub-warp cooperation minimizes redundant calculation.

Table of representative optimization impacts for a $B=256$ block size (Jünger et al., 17 Dec 2025):

Stage	Lookup (L2-resident)	Insert (DRAM)
CBF baseline	1.0×	1.0×
→ SBF (blocked)	3.2×	4.5×
→ +multiplicative hashing	1.7×	1.4×
→ +horizontal vectorization	1.0× (Θ=1→2)	1.7×
→ +adaptive cooperation	1.1×	1.2×
Total over baseline	≈9.5×	≈11.6×

Achieved filter lookup throughputs can approach $92\%$ of bandwidth-based speed-of-light for cache-resident cases and maintain double-digit speedups in bulk DRAM scenarios (Jünger et al., 17 Dec 2025).

6. Empirical Results and Tuning Guidelines

In Stream-K++, lookup of the Bloom filter on the CPU costs approximately $0.4\,\upmu$s per key (Murmur3 + seven bit-probes). GEMM-tuning overhead decreases nearly $5\times$ , as only $4.2\%$ of policy candidates survive the initial Bloom filtering step. End-to-end, Stream-K++ results in performance improvements up to $43\%$ compared to non-Bloom baselines and is within $20\%$ of globally optimal performance for $60$– $97.6\%$ of problem sizes (Sadasivan et al., 2024).

Tuning recommendations for modern GPU deployments include (Sadasivan et al., 2024, Jünger et al., 17 Dec 2025):

Use $m \approx (8$ –$12)n$ bits for $P_{fp}<1\%$ and $k=6$ –$8$.
Align per-policy bit-arrays to at least $128$ bytes for warp-aligned loads.
Keep $k$ low ( $\leq 8$ ) to minimize per-query work.
For filters under $64$ KiB, leverage constant memory/L1; otherwise, exploit L2/DRAM cache line batching.
Optimize blockDim.x for warp-aligned key/word assignment to maximize coalescing.
Pre-warm filter bit-arrays (prefetch) for DRAM-resident scenarios to minimize cold-miss penalties.
For large bulk queries, autotune parameters $(\Theta, \Phi)$ per deployment to maximize occupancy and prevent register spill.

7. Broader Applicability and Domain Extensions

The GPU Bloom index pattern—compact bit-arrays, multi-hash probing, and coalesced parallel access—is broadly applicable beyond kernel scheduling. Identified extensions include:

Automatic selection and indexing for convolution kernels in deep learning frameworks.
Tuning parameter space reduction in graph analytics and community-detection hyperparameter search.
Approximate-nearest-neighbor prefiltering in high-dimensional similarity search.
Prefiltering invalid memory-layout transformation candidates in program compilers.
Join-key filtering in database engines to accelerate join operations.

Moving filter construction and lookup entirely onto the GPU enables completely asynchronous, hardware-accelerated indexing, eliminating the need for CPU-GPU round-trips and enabling real-time filtering in large-scale autotuning and data analytic pipelines (Sadasivan et al., 2024, Jünger et al., 17 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Stream-K++: Adaptive GPU GEMM Kernel Scheduling and Selection using Bloom Filters (2024)

Optimizing Bloom Filters for Modern GPU Architectures (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPU Bloom Index Algorithm.

GPU Bloom Index Algorithm

1. Data Structures and Memory Organization

2. Hash Function Schemes and Parallelization

3. GPU Kernel Implementations and Coalesced Access Patterns

4. False-Positive Rate Analysis and Parameter Selection

5. Performance Models and Optimization Dimensions

6. Empirical Results and Tuning Guidelines

7. Broader Applicability and Domain Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GPU Bloom Index Algorithm

1. Data Structures and Memory Organization

2. Hash Function Schemes and Parallelization

3. GPU Kernel Implementations and Coalesced Access Patterns

4. False-Positive Rate Analysis and Parameter Selection

5. Performance Models and Optimization Dimensions

6. Empirical Results and Tuning Guidelines

7. Broader Applicability and Domain Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research