Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Filtered-ANNS on GPUs

Updated 30 June 2025

Filtered-ANNS on GPUs are a collection of algorithms and data structures that combine fast candidate selection with metadata filtering on GPU hardware.
They employ GPU-specific techniques like label-centric indexing and memory coalescing to achieve scalable, low-latency performance in real-time applications.
Empirical results highlight significant improvements in throughput and efficiency, outperforming traditional CPU-based methods for large-scale, filtered vector searches.

Filtered-ANNS on GPUs refers to a collection of algorithms, indexing techniques, and system-level optimizations that enable performant and scalable approximate nearest neighbor search (ANNS) under user-specified filtering constraints on GPU hardware. This encompasses methods that combine sublinear-time candidate selection with filtering by metadata, labels, or attribute ranges, leveraging the massive parallelism and memory systems unique to GPUs. Recent research demonstrates that, through tailored data structures and architecture-aware implementations, filtered-ANNS can reach million- or billion-scale query throughput while maintaining high recall and supporting real-time, online, and multi-label filtering in production-scale environments.

1. Problem Definition and Scope

Filtered-ANNS addresses the challenge of retrieving the top- $k$ nearest vectors to a query $q$ from a dataset $\mathcal{D}$ , subject to additional constraints—typically, the requirement that vectors match specific attribute predicates such as labels, numeric ranges, or multi-attribute conjunctions. Formally, this is

$k\textrm{NN}(q, \mathcal{R}), \quad \mathcal{R} = \{v \in \mathcal{D} \mid v \textrm{ satisfies filter(s)}\}$

where the filter may specify single or multiple attribute constraints (e.g., $l \leq v[A] \leq h$ for range, or $v[L]\ni l_j$ for label membership).

The field includes filter-aware graph-based methods, partition-based indexes, dual-structured label-centric approaches, and system-level GPU optimizations that handle both candidate search and filter verification in a high-throughput setting. Research also addresses online insertion, multi-stream execution, and memory-boundedness on accelerators.

2. Algorithmic Foundations and Data Structures

2.1 Label-centric Inverted File Indexing

VecFlow (2506.00812) proposes a dual-structured label-centric IVF index, partitioning each label’s posting list by specificity. High-specificity labels (frequent) use a GPU-friendly graph (IVF-Graph); low-specificity labels (rare) use an optimized brute-force search (IVF-BFS).

$C_l = \{i \mid l \in L_{X_i},\ 0 \leq i < |X|\}$

This approach supports both single- and multi-label queries (AND/OR), exploits redundancy-bypassing to minimize memory use, and fuses filtering with distance computation.

2.2 Proximity Graphs with Filter Support

Systems such as UNIFY (2412.02448) and Filtered-DiskANN segment datasets by attribute (e.g., range bins or labels) and construct inclusive proximity graphs (SIG) ensuring that, for any segment or segment combination, the induced subgraph supports efficient hybrid search. The Hierarchical Segmented Inclusive Graph (HSIG) variant incorporates HNSW-like structure and skip lists to support all three filtering strategies:

Pre-filtering: range filter before search.
Hybrid filtering: extract subgraph for relevant segments and search only those.
Post-filtering: filter after unconstrained search.

Proximity graphs for each label or segment are implemented as subgraphs with metadata for efficient edge masking and candidate filtering.

2.3 Partition-Based Indices

CAPS (2308.15014) structures the index with a first-stage vector clustering (e.g., KMeans) and a second-stage subpartitioning via Attribute Frequency Trees (AFT), enabling highly parallel scan/filter per cluster and subpartition. Such designs allow for efficient conjunctive queries and dynamic filter selectivity handling.

2.4 Filter-Accelerated Candidate Management

High-performance set membership and counting filters such as TCF and GQF (2212.09005) provide thread-safe duplicate removal, visit counting, and adaptive candidate pools during graph or partition-based search. These filters are designed to exploit GPU memory layouts, atomic operations, and bulk APIs for scalable candidate filtering in real-time.

3. GPU Architecture-Aware Optimizations

3.1 Memory Layout and Access

Efficient filtered-ANNS on GPUs requires coalesced memory access and minimized bank conflicts. Innovations include:

Custom 8-bit floating point storage (FP8) for PQ lookup tables (2301.06672), reducing shared memory bank conflicts in IVFPQ from $\leq 7$ to $1$, thereby maximizing throughput with negligible recall penalty.
Interleaved memory storage for small clusters (IVF-BFS) and persistent GPU kernels for streaming workloads (2506.00812).

3.2 Execution Model

Systems such as RTAMS-GANNS (2408.02937) decouple search and insertion via multi-stream execution, allocating separate CUDA streams and resource pools for different operations. Vectors are managed in small, pointer-linked blocks supporting atomic parallel insertion, dynamic rearrangement (in-place defragmentation), and batch resource allocation.

3.3 Redundancy-Bypassing and Graph Compaction

To avoid vector replication for the exponential number of label or attribute combinations, redundancy-bypassing is implemented (2506.00812): vectors are stored in a single global array, with local-to-global index mapping for each virtual per-label graph. Compacted adjacency lists and mapping tables ensure memory efficiency even at scale.

4. Performance, Scalability, and Empirical Results

Filtered-ANNS on GPUs now routinely exceeds 1M QPS at high recall on practical datasets, as confirmed by VecFlow (2506.00812) ($5$M QPS at $90\%$ recall), and maintains low latency ( $<8$ ms typical in RTAMS-GANNS (2408.02937)) under mixed online workloads. Notable findings include:

Label-centric GPU designs outperform CPU-filtered graph approaches (e.g., Filtered-DiskANN), with up to $135\times$ QPS improvement for the same accuracy regime.
Partition- and graph-based systems achieve strong scalability: CAPS has indexes up to $10\times$ smaller than graph-based methods, enabling all-GPU batch search at scale.
Persistent kernels and interleaved memory layouts support both batch and streaming (single/small batch) workloads with linear hardware scaling.
Near-linear speedup is achievable with multiple inhomogeneous GPUs for matrix-mult core routines (1511.04348), subject to load-balancing and bandwidth management.

5. Filtered Query Semantics and Supported Workloads

Modern filtered-ANNS supports:

Single-label, multi-label (AND/OR) queries: Each label maintains a posting list; queries are routed through smallest (most selective) list for AND, or through multiple merged lists for OR (2506.00812).
Numeric range queries: SIG/HSIG (2412.02448) partitions vectors by range and supports hybrid strategies, with adaptive selection based on estimated result count.
Arbitrary attribute predicates: Attribute metadata is indexed for fast filtering, with candidate verification occurring inline to vector distance calculation to maximize computation/communication overlap.

Such systems are compatible with e-commerce search, multimedia retrieval, social network item discovery, large-scale text/database embedding search, and online recommendation workloads.

6. System Deployment, Limitations, and Future Directions

Systems such as VecFlow and RTAMS-GANNS have seen real-world industrial deployment, supporting hundreds of millions of daily users. Key deployment notes:

Block-based memory management and batch resource pooling allow sustained performance under heavy, mixed workloads.
Online insertion for real-time applications is supported using atomic operations and memory block linking; periodic in-place rearrangement maintains search efficiency under fragmentation.
GPU memory remains a bottleneck for extremely large datasets; hybrid CPU-GPU pipelines such as PilotANN (2503.21206) overcome this by restricting GPU search to a candidate subgraph with SVD-reduced vectors, followed by CPU refinement.

Open research directions involve further adaption to dynamic online environments, efficient distributed deployment across GPU clusters, and integration with more complex filter expressions and multi-attribute hierarchies.

7. Reference Table: Summary of Key Techniques

System	Core Index Structure	Filtering Support	GPU Optimization	Performance Highlights
VecFlow	Label-centric dual IVF/BFS	Multi-label, AND/OR	Persistent kernel, interleaved mem	5M QPS@90% recall (A100)
UNIFY (HSIG)	Hierarchical segmented PG	Range, hybrid, all	Parallel batched graph search	SOTA across all ranges
CAPS	Partition + attribute tree	Conjunctive, multi-attr	Data-parallel subpartitioning	10× smaller index
RTAMS-GANNS	Block-based IVF, multi-stream	Dynamic, online	Block chains, multi-stream exec	40–80% latency reduction
PilotANN	Hybrid CPU-GPU graph traversal	Dynamic, memory-bound	SVD reduction, fast entry selection	3.9–5.4× CPU QPS

Conclusion

Filtered-ANNS on GPUs now encompasses a rich landscape of algorithmic and systems-level innovations, supporting high-selectivity, high-throughput, and low-latency approximate nearest neighbor queries under filtering constraints. Advances such as label-centric dual indexing, hierarchical hybrid filtering graphs, memory-aware block management, and persistent execution models collectively enable production-scale, real-time filtered vector search leveraging the capabilities of modern GPU hardware. Empirical validations confirm state-of-the-art recall and throughput in both academic and industrial contexts, with open-source implementations facilitating adoption and further development in the community.