Filtered-ANNS on GPUs
- Filtered-ANNS on GPUs are a collection of algorithms and data structures that combine fast candidate selection with metadata filtering on GPU hardware.
- They employ GPU-specific techniques like label-centric indexing and memory coalescing to achieve scalable, low-latency performance in real-time applications.
- Empirical results highlight significant improvements in throughput and efficiency, outperforming traditional CPU-based methods for large-scale, filtered vector searches.
Filtered-ANNS on GPUs refers to a collection of algorithms, indexing techniques, and system-level optimizations that enable performant and scalable approximate nearest neighbor search (ANNS) under user-specified filtering constraints on GPU hardware. This encompasses methods that combine sublinear-time candidate selection with filtering by metadata, labels, or attribute ranges, leveraging the massive parallelism and memory systems unique to GPUs. Recent research demonstrates that, through tailored data structures and architecture-aware implementations, filtered-ANNS can reach million- or billion-scale query throughput while maintaining high recall and supporting real-time, online, and multi-label filtering in production-scale environments.
1. Problem Definition and Scope
Filtered-ANNS addresses the challenge of retrieving the top- nearest vectors to a query from a dataset , subject to additional constraints—typically, the requirement that vectors match specific attribute predicates such as labels, numeric ranges, or multi-attribute conjunctions. Formally, this is
where the filter may specify single or multiple attribute constraints (e.g., for range, or for label membership).
The field includes filter-aware graph-based methods, partition-based indexes, dual-structured label-centric approaches, and system-level GPU optimizations that handle both candidate search and filter verification in a high-throughput setting. Research also addresses online insertion, multi-stream execution, and memory-boundedness on accelerators.
2. Algorithmic Foundations and Data Structures
2.1 Label-centric Inverted File Indexing
VecFlow (2506.00812) proposes a dual-structured label-centric IVF index, partitioning each label’s posting list by specificity. High-specificity labels (frequent) use a GPU-friendly graph (IVF-Graph); low-specificity labels (rare) use an optimized brute-force search (IVF-BFS).
This approach supports both single- and multi-label queries (AND/OR), exploits redundancy-bypassing to minimize memory use, and fuses filtering with distance computation.
2.2 Proximity Graphs with Filter Support
Systems such as UNIFY (2412.02448) and Filtered-DiskANN segment datasets by attribute (e.g., range bins or labels) and construct inclusive proximity graphs (SIG) ensuring that, for any segment or segment combination, the induced subgraph supports efficient hybrid search. The Hierarchical Segmented Inclusive Graph (HSIG) variant incorporates HNSW-like structure and skip lists to support all three filtering strategies:
- Pre-filtering: range filter before search.
- Hybrid filtering: extract subgraph for relevant segments and search only those.
- Post-filtering: filter after unconstrained search.
Proximity graphs for each label or segment are implemented as subgraphs with metadata for efficient edge masking and candidate filtering.
2.3 Partition-Based Indices
CAPS (2308.15014) structures the index with a first-stage vector clustering (e.g., KMeans) and a second-stage subpartitioning via Attribute Frequency Trees (AFT), enabling highly parallel scan/filter per cluster and subpartition. Such designs allow for efficient conjunctive queries and dynamic filter selectivity handling.
2.4 Filter-Accelerated Candidate Management
High-performance set membership and counting filters such as TCF and GQF (2212.09005) provide thread-safe duplicate removal, visit counting, and adaptive candidate pools during graph or partition-based search. These filters are designed to exploit GPU memory layouts, atomic operations, and bulk APIs for scalable candidate filtering in real-time.
3. GPU Architecture-Aware Optimizations
3.1 Memory Layout and Access
Efficient filtered-ANNS on GPUs requires coalesced memory access and minimized bank conflicts. Innovations include:
- Custom 8-bit floating point storage (FP8) for PQ lookup tables (2301.06672), reducing shared memory bank conflicts in IVFPQ from to $1$, thereby maximizing throughput with negligible recall penalty.
- Interleaved memory storage for small clusters (IVF-BFS) and persistent GPU kernels for streaming workloads (2506.00812).
3.2 Execution Model
Systems such as RTAMS-GANNS (2408.02937) decouple search and insertion via multi-stream execution, allocating separate CUDA streams and resource pools for different operations. Vectors are managed in small, pointer-linked blocks supporting atomic parallel insertion, dynamic rearrangement (in-place defragmentation), and batch resource allocation.
3.3 Redundancy-Bypassing and Graph Compaction
To avoid vector replication for the exponential number of label or attribute combinations, redundancy-bypassing is implemented (2506.00812): vectors are stored in a single global array, with local-to-global index mapping for each virtual per-label graph. Compacted adjacency lists and mapping tables ensure memory efficiency even at scale.
4. Performance, Scalability, and Empirical Results
Filtered-ANNS on GPUs now routinely exceeds 1M QPS at high recall on practical datasets, as confirmed by VecFlow (2506.00812) ($5$M QPS at recall), and maintains low latency (ms typical in RTAMS-GANNS (2408.02937)) under mixed online workloads. Notable findings include:
- Label-centric GPU designs outperform CPU-filtered graph approaches (e.g., Filtered-DiskANN), with up to QPS improvement for the same accuracy regime.
- Partition- and graph-based systems achieve strong scalability: CAPS has indexes up to smaller than graph-based methods, enabling all-GPU batch search at scale.
- Persistent kernels and interleaved memory layouts support both batch and streaming (single/small batch) workloads with linear hardware scaling.
- Near-linear speedup is achievable with multiple inhomogeneous GPUs for matrix-mult core routines (1511.04348), subject to load-balancing and bandwidth management.
5. Filtered Query Semantics and Supported Workloads
Modern filtered-ANNS supports:
- Single-label, multi-label (AND/OR) queries: Each label maintains a posting list; queries are routed through smallest (most selective) list for AND, or through multiple merged lists for OR (2506.00812).
- Numeric range queries: SIG/HSIG (2412.02448) partitions vectors by range and supports hybrid strategies, with adaptive selection based on estimated result count.
- Arbitrary attribute predicates: Attribute metadata is indexed for fast filtering, with candidate verification occurring inline to vector distance calculation to maximize computation/communication overlap.
Such systems are compatible with e-commerce search, multimedia retrieval, social network item discovery, large-scale text/database embedding search, and online recommendation workloads.
6. System Deployment, Limitations, and Future Directions
Systems such as VecFlow and RTAMS-GANNS have seen real-world industrial deployment, supporting hundreds of millions of daily users. Key deployment notes:
- Block-based memory management and batch resource pooling allow sustained performance under heavy, mixed workloads.
- Online insertion for real-time applications is supported using atomic operations and memory block linking; periodic in-place rearrangement maintains search efficiency under fragmentation.
- GPU memory remains a bottleneck for extremely large datasets; hybrid CPU-GPU pipelines such as PilotANN (2503.21206) overcome this by restricting GPU search to a candidate subgraph with SVD-reduced vectors, followed by CPU refinement.
Open research directions involve further adaption to dynamic online environments, efficient distributed deployment across GPU clusters, and integration with more complex filter expressions and multi-attribute hierarchies.
7. Reference Table: Summary of Key Techniques
System | Core Index Structure | Filtering Support | GPU Optimization | Performance Highlights |
---|---|---|---|---|
VecFlow | Label-centric dual IVF/BFS | Multi-label, AND/OR | Persistent kernel, interleaved mem | 5M QPS@90% recall (A100) |
UNIFY (HSIG) | Hierarchical segmented PG | Range, hybrid, all | Parallel batched graph search | SOTA across all ranges |
CAPS | Partition + attribute tree | Conjunctive, multi-attr | Data-parallel subpartitioning | 10× smaller index |
RTAMS-GANNS | Block-based IVF, multi-stream | Dynamic, online | Block chains, multi-stream exec | 40–80% latency reduction |
PilotANN | Hybrid CPU-GPU graph traversal | Dynamic, memory-bound | SVD reduction, fast entry selection | 3.9–5.4× CPU QPS |
Conclusion
Filtered-ANNS on GPUs now encompasses a rich landscape of algorithmic and systems-level innovations, supporting high-selectivity, high-throughput, and low-latency approximate nearest neighbor queries under filtering constraints. Advances such as label-centric dual indexing, hierarchical hybrid filtering graphs, memory-aware block management, and persistent execution models collectively enable production-scale, real-time filtered vector search leveraging the capabilities of modern GPU hardware. Empirical validations confirm state-of-the-art recall and throughput in both academic and industrial contexts, with open-source implementations facilitating adoption and further development in the community.