BatANN: Distributed Disk-Based ANN Search
- BatANN is a distributed, disk-based ANN search system designed for massive vector datasets using a global proximity graph and an efficient baton-passing protocol.
- It partitions the global graph with a neighborhood-aware algorithm to minimize inter-server I/O and maintain logarithmic search efficiency.
- Performance evaluations demonstrate low latency (<6ms) and high recall (0.95@10) with near-linear throughput scaling across multiple servers.
BatANN is a distributed, disk-based approximate nearest neighbor (ANN) search system designed to support high-throughput, low-latency vector search over datasets too large to fit in memory or on a single server. By integrating a single global proximity graph, neighborhood-aware partitioning, and a baton-passing protocol that transfers full query state across servers, BatANN ensures logarithmic search efficiency, minimal inter-server I/O, and near-linear throughput scaling as the number of servers increases. It is the first open-source distributed disk-based vector search system to operate over a single global graph (Dang et al., 10 Dec 2025).
1. Problem Setting and Motivation
Modern information-retrieval applications—including retrieval-augmented generation (RAG), large-scale image search, and recommendation—require efficient -nearest neighbor (k-NN) queries in high-dimensional embedding spaces. A brute-force (exact) search has complexity or worse due to dimensionality, making it impractical for large . Graph-based ANN methods, empirically achieving convergence using proximity graphs, are the dominant scalable solution.
As dataset sizes approach billions of points, global index structures exceed DRAM capacity. Disk-based systems (e.g., DiskANN, Starling) store full embeddings and neighbor lists on SSDs, retaining quantized vector summaries (PQ codes) in memory to prune candidate sets. However, a single-server design is ultimately limited by SSD IOPS and bandwidth. Distributing storage and computation across multiple machines can raise query-per-second (QPS) throughput, but naïve sharding ("scatter–gather") strategies lose logarithmic search efficiency and waste I/O. Prior distributed solutions based on global graphs incur high inter-server latency due to round-trip communication per graph hop.
Target system metrics are Recall@k (e.g., 0.95 at ), throughput (QPS) at fixed recall, and end-to-end latency below 6 ms even at very high QPS (Dang et al., 10 Dec 2025).
2. Index Structure and System Architecture
2.1 Global Proximity Graph (Vamana)
BatANN constructs a single global proximity graph using the Vamana algorithm. Each node in (representing a datapoint) maintains up to out-edges ( in practice), selected by a diversifying heuristic to improve graph navigability. Formally, each node retains a neighbor set , , designed so that greedy traversal from any starting point converges to the true nearest neighbors in steps.
Indexing details:
- On-disk: Each vector's full (float) embedding and its neighbor ID list are stored, typically compressed to fit within a 4 KB SSD sector.
- In-memory: Product-quantized (PQ) codes, consuming 32 bytes per 128-dimensional vector, guide the search. Overall DRAM usage is bytes.
2.2 Distributed Graph Partitioning
The global graph is partitioned into server shards using a neighborhood-aware graph partitioner [arXiv (Gottesbüren et al., 4 Mar 2024)], minimizing off-server beam-search hops. Empirical results indicate that, with 10 servers, only 20–25% of graph hops cross server boundaries.
Each server is provisioned as follows:
- Replicated PQ codes for all vectors.
- Local SSD storage for embeddings and neighbor lists of its partition.
- An in-memory “head” graph for rapid starting-point selection, consisting of a 1% random sample of .
3. The Baton-Passing Protocol
3.1 Motivation
Previous distributed global-graph approaches incur at least two network round-trips per off-server neighbor access (request + response). BatANN’s key innovation is to transfer ("pass") the entire query’s state to the owning server, which then continues the beam search locally. This asynchronous baton-passing approach reduces per-hop network overhead by nearly half and aligns with modern high-throughput networking and SSD architectures.
3.2 Query State and Protocol
A query’s “baton” (state ) includes:
- Beam (size , each entry node, approx-dist),
- Explored set ,
- Full-precision result list for reranking,
- Parameters and the query embedding .
For typical , message size is $4$–$8$ KB.
The distributed beam-search protocol operates as follows:
- The client hashes to a designated server , which performs an in-memory head-index search and initializes .
- While unexplored beam nodes remain:
- Select up to unexplored nodes.
- If any are local: issue concurrent IO_uring SSD reads and update beam/explored set.
- If all are remote: select the node with minimum distance, serialize , and transfer the baton to the server owning . That server resumes execution.
- Once all beam nodes are explored, the holding server returns top- results to the client.
Each baton-passing step incurs latency (serialization, s) plus (TCP, s on 25 GbE) per inter-server hop.
4. Complexity, Efficiency, and Optimizations
4.1 Computational Cost
On a single server holding points, greedy beam search converges in steps. Retaining a single global graph ensures total hops remain regardless of . Each step issues up to concurrent SSD reads, and with high-random-IOPS NVMe SSDs (300K), increasing has negligible impact on per-step read cost.
4.2 Throughput Scaling
Because per-query disk I/O and distance computation are nearly constant irrespective of , distributing the same workload enables near-linear scalability. Throughput on servers can be modeled as , where accounts for inter-server hop overhead.
4.3 Locality and Algorithmic Heuristics
Graph partitioning (arXiv (Gottesbüren et al., 4 Mar 2024)) yields 75–88% of hops that remain local, limiting baton passes. When , a heuristic additionally processes all local candidates before considering baton passing, further reducing inter-server communication.
4.4 Batching and Caching
Each thread processes a fixed set of 8 queries in a pipelined, interleaved manner to overlap SSD I/O and computation. This yields 20–30% higher single-server throughput. Query embedding is cached after the initial transmission and is reused until final result acknowledgment, eliminating duplication.
5. Experimental Evaluation
5.1 Datasets and Experimental Setup
The system is evaluated on 10-node CloudLab c6620 clusters (28-core Intel Xeon Gold, 128 GB DRAM, 25 GbE) using standard Big-ANN benchmarks:
- BIGANN: 100M and 1B SIFT 128-dim (uint8), L2 distance,
- MSSPACEV: 100M and 1B SpaceV 100-dim (int8),
- DEEP: 100M 96-dim float conv-net features.
5.2 Baselines
- ScatterGather: Identical graph and partitioning, but each subgraph is queried independently then merged (no global graph).
- Single-server baselines: DiskANN, PipeANN, CoroSearch (for microbenchmarking).
Index built using Vamana with , , (PipeANN for 100M; ParlayANN for 1B).
5.3 Performance Highlights
At Recall@10 = 0.95:
- For 100M points, 10 servers, BatANN achieves 6.21–6.49 QPS improvement vs. ScatterGather (e.g., 6.49 on DEEP).
- For 1B points, 10 servers, BatANN achieves 5.10 (BIGANN), 2.5 (MSSPACEV).
- BatANN’s total disk I/O and distance calculations remain within 1% of single-server levels; in contrast, ScatterGather’s I/O and compute scale proportionally with .
- QPS scales as at fixed recall with 10 servers; ScatterGather saturates after a few shards.
| Dataset | Size | Servers | QPS Speedup (BatANN vs ScatterGather) | Mean Latency (ms) |
|---|---|---|---|---|
| BIGANN | 100M, 1B | 10 | 6.21–6.49× (100M), 5.10× (1B) | 6 |
| MSSPACEV | 100M, 1B | 10 | 6.21–6.49× (100M), 2.5× (1B) | 6 |
| DEEP | 100M | 10 | 6.49× | 6 |
At saturating QPS rates, BatANN maintains mean latency ms (rising only 13% from 5 to 10 servers), whereas ScatterGather’s tail latency degrades rapidly above 6K QPS.
5.4 Beam Width Ablation
Ablating beam width from to reduces total and inter-server hops by approximately 4, yielding over 2 higher QPS and about half the latency at 0.95 recall@10, with no extra CPU or I/O cost.
6. Limitations and Future Directions
- Message size: Each baton currently sends the entire result list ( nodes); streaming only the beam state and piggybacking partial results is a proposed future optimization.
- Disk layout: Incorporating locality-aware disk placement (as in Starling’s 4 KB-aligned strategy) and dynamic pipeline width (as in PipeANN) could further minimize I/O and latency.
- Dynamic updates: Efficient distributed point insert/delete without full repartitioning remains unresolved; multi-node in-place update mechanisms have yet to be generalized.
- Fault tolerance: There is no current replication or fail-over; integration with state machine replication systems such as Derecho is a suggested extension.
- Network optimizations: Although commodity TCP suffices for multi-KB baton messages, the architecture would readily enable RDMA or CXL deployment, whose benefits are yet to be quantified.
- Multitenancy: Resource sharing and scheduling for multiple distinct vector search databases per cluster remain open research issues.
A plausible implication is that BatANN’s methodology may serve as a blueprint for distributed sublinear-time search over massive graph-structured data, given continued progress on distributed dynamic indexing and system-level robustness (Dang et al., 10 Dec 2025).