Distributed Disk-Based Vector Search

Updated 11 December 2025

Distributed Disk-Based Vector Search is a technique that leverages SSD-resident, graph-based indices and distributed query execution for high-dimensional similarity search.
It combines in-memory caching with SSD and distributed storage to achieve low latency and cost-effective scalability through smart partitioning and beam search.
Empirical results show sub-20ms latency and high recall, demonstrating the practical viability of hybrid architectures for trillion-scale vector datasets.

Distributed disk-based vector search encompasses algorithmic and systems methodologies for performing approximate nearest neighbor search (ANNS) over large collections of high-dimensional vectors, where the index and dataset reside predominantly on SSDs or distributed storage, and queries are executed in a parallel or distributed compute environment. Recent research demonstrates that serving semantic search and retrieval-augmented generation (RAG) workloads at billion-scale vector corpora is feasible with cost-efficient infrastructure, preserving low latency, high recall, and operational scalability by means of hybrid in-memory and disk-resident graph indices, smart partitioning, and coupled distributed query execution protocols (Upreti et al., 9 May 2025, Adams et al., 7 Sep 2025, Dang et al., 10 Dec 2025, Yu et al., 20 Oct 2025).

1. System Architectures for Distributed Disk-Based Vector Search

Distributed disk-based vector search architectures are characterized by the following dimensions:

Index organization: Modern systems employ graph-based ANNS indices (e.g., DiskANN, Vamana, HNSW derivatives) in which both vectors and neighborhood connectivity data are stored on persistent, high-throughput media (SSD, DFS, object stores), and portions of the index (e.g., quantized codes, head-indices) reside in memory for fast access and search initialization (Dang et al., 10 Dec 2025).
Partitioning strategies: Approaches vary from per-partition (“sharded”) independent ANN graphs (Upreti et al., 9 May 2025) to single global graphs partitioned by key-value (random, hash) or balanced graph partitioning that minimizes neighborhood cut edges (Adams et al., 7 Sep 2025, Dang et al., 10 Dec 2025).
Distributed storage backends: Storage is provided by scalable backends, including cloud DRAM+SSD (Azure Cosmos DB (Upreti et al., 9 May 2025)), distributed key-value stores (Dynamo-style (Adams et al., 7 Sep 2025)), or object/DFS systems (S3, Pangu, HDFS (Yu et al., 20 Oct 2025)).
Query execution: Distributed query execution coordinates search across multiple partitions or servers, typically via orchestrated beam or greedy search traversals that overlap network, computation, and disk I/O (Adams et al., 7 Sep 2025, Dang et al., 10 Dec 2025).

A structured comparison of representative systems is given below:

System	Partitioning	Index Structure	Storage Layer	Coordination/Execution
Cosmos DB+DiskANN (Upreti et al., 9 May 2025)	Key-based, multi-partition	Graph per partition (DiskANN)	Bw-Tree on SSD, DRAM cache	Parallel fanned queries
DistributedANN (Adams et al., 7 Sep 2025)	Hash/random, single global graph	Distributed DiskANN, in-RAM head-index	Distributed key-value w/ SSD	Orchestrated, near-data scores
BatANN (Dang et al., 10 Dec 2025)	Balanced graph, single global graph	Distributed Vamana	SSD, DRAM PQ head-index	Baton-passing beam search
DSANN (Yu et al., 20 Oct 2025)	Graph-cluster hybrid	Point Aggregation Graph (PAG)	DFS/object store, DRAM	Asynchronous greedy + I/O

2. Index Data Structures and Partitioning Methodologies

The dominant indexing schema is the graph-based proximity structure, most commonly realized via DiskANN (Upreti et al., 9 May 2025), Vamana (Dang et al., 10 Dec 2025), or HNSW-like small-world networks.

DiskANN Index Integration

In Azure Cosmos DB, DiskANN is tightly integrated by introducing two term types into the log-structured Bw-Tree: quantized vector entries storing compressed embeddings per document ID and adjacency-list entries storing neighbors for each node. This approach permits the index to be fully SSD-resident, with only 5 GB of hot quantized codes and neighbor lists cached in memory for 10M vectors. Full-precision vectors are used strictly for a small candidate set re-ranking (Upreti et al., 9 May 2025).

Partitioned vs. Global Graphs

Clustered partitioning, where each partition has an independent index, offers isolation and scalability at the expense of sublinear search, as the search complexity is $O(P \cdot \log(|X|/P))$ for $|X|$ vectors on $P$ partitions (Adams et al., 7 Sep 2025). Recent systems—DistributedANN (Adams et al., 7 Sep 2025), BatANN (Dang et al., 10 Dec 2025)—reject this in favor of a single global graph sharded across machines, preserving $O(\log N)$ search hops and increasing recall for a fixed I/O and latency budget.

DSANN (Yu et al., 20 Oct 2025) employs a Point Aggregation Graph, building a proximity graph on a memory-resident subset and aggregating residual vectors into partition lists stored on distributed storage. This hybrid structure allows for rapid query-time in-memory graph walks and asynchronous I/O for partition lists.

Partition Wiring and Neighbor Locality

Graph partitioning methods (e.g., Gottesbüren et al. (Dang et al., 10 Dec 2025)) are used to co-locate as many neighbor pairs as possible, minimizing the fraction $p$ of graph-hops that require cross-machine transfer in search, thus reducing network-induced latency.

3. Distributed and Query-Time Algorithms

Beam and Greedy Search over Disk-Based Graphs

All advanced systems use forms of beam or greedy search over the disk-resident proximity graph. In BatANN, a beam of candidate nodes is repeatedly expanded using PQ distances; full SSD fetches only occur for top beam elements, and each includes a node’s embedding and neighbors in a single 4KB sector. Once candidates are exhausted locally, BatANN migrates the entire beam state (“baton passing”) to the appropriate remote server, rather than incurring multiple synchronous RPCs per off-server neighbor, which minimizes network overhead (Dang et al., 10 Dec 2025).

In DistributedANN, the NodeScoring service computes distances and candidate heaps near the storage device, returning compact (id, score) pairs, further reducing network I/O (Adams et al., 7 Sep 2025).

Asynchronous I/O Scheduling

Query execution in DSANN is orchestrated as a stateless greedy walk on the in-memory aggregation graph, with non-blocking requests for partition lists as nodes are explored (Yu et al., 20 Oct 2025). This approach overlaps CPU computation and DFS/network I/O, amortizing long-latency disk operations across multiple query steps and partitions.

Update Operations and Index Consistency

Cosmos DB+DiskANN incorporates a robust, in-place insert/delete protocol with local rewiring and robust pruning. Blind patches (for insertions) and two-hop rewires (for deletions) allow for immediate index updates without performing full graph merges or rebuilds, resulting in recall stability that stays within 1–3% of the baseline under high churn (Upreti et al., 9 May 2025). However, BatANN and DSANN in current published forms assume a static snapshot of the dataset; support for online updates with strong consistency is an open problem (Dang et al., 10 Dec 2025, Yu et al., 20 Oct 2025).

4. Performance, Cost Analyses, and Empirical Benchmarks

Distributed disk-based vector search systems demonstrate significant throughput and cost advantages at scale relative to both monolithic in-RAM systems and naive partitioned architectures.

Empirical Performance

Cosmos DB+DiskANN achieves $<$ 20 ms median query latency and Recall@10 ≈ 90.8% on a single partition of 10M 768-D vectors, with p95 ≈ 27ms (Upreti et al., 9 May 2025).
BatANN sustains under 6 ms mean end-to-end latency for 100M to 1B points at Recall@10 = 0.95, with 10 servers providing 6.49× (100M) and up to 5.1× (1B) the throughput of scatter–gather baselines (Dang et al., 10 Dec 2025).
DistributedANN demonstrates throughput $>100$ k QPS on 50B vectors across $\sim$ 1000 machines, with Recall@5 = 90.8% and median latency 26 ms, outperforming clustered partitioning by 6× in throughput (Adams et al., 7 Sep 2025).
DSANN achieves near-linear QPS scale-out with compute nodes and 4–7× the QPS of single-node DiskANN baselines on billion-scale datasets stored in DFS, with 99.9% latency below 6 ms on SIFT (Yu et al., 20 Oct 2025).

Cost Models

The per-query cost model for DiskANN-based systems is $C_q = O(d\,\log N_p + \mathrm{I/O}_p)$ , where $N_p$ is vectors per partition, $d$ is dimensionality, and $\mathrm{I/O}_p$ is the number of full-vector reads. Cost scales sublinearly with partition size and is dominated by SSD I/O for large $N$ .

Cosmos DB demonstrates a per-million-query cost an order of magnitude lower than specialized vector database services (Zilliz, Pinecone) for comparable recall (Upreti et al., 9 May 2025).

5. Fault Tolerance, Resource Management, and Design Trade-Offs

Fault and Availability

Distributed disk-based search exploits underlying storage replication and stateless compute to provide high availability. DSANN is robust to compute server loss, as compute nodes do not hold unique index shards and failover incurs no index reload (Yu et al., 20 Oct 2025). DistributedANN and BatANN similarly benefit from data and query statelessness, facilitating operational resilience (Adams et al., 7 Sep 2025, Dang et al., 10 Dec 2025). BatANN currently assumes a static, non-fault-tolerant index but plans to integrate RDMA-based or Derecho-style replication layers (Dang et al., 10 Dec 2025).

Resource Control and Elasticity

Cosmos DB’s resource-governing Request Units and automatic partitioning promote fine-grained cost control and multi-tenant sharding, building on existing cloud DB elasticity (Upreti et al., 9 May 2025). DSANN and BatANN both achieve near-linear QPS scaling with number of servers or compute nodes.

Design Trade-Offs

Deep DiskANN integration with Bw-Tree offers $10\times$ – $50\times$ lower memory footprint compared to in-memory bespoke solutions at the expense of peak throughput, but with stability and lower infra cost (Upreti et al., 9 May 2025). DistributedANN trades a moderate increase in latency (26ms vs. 16ms) and larger disk+RAM footprint for higher throughput and simplified cluster management (no partition routing/replica tracking) (Adams et al., 7 Sep 2025). BatANN demonstrates, for the first time, $O(\log N)$ search complexity at near-linear distributed throughput on a single global graph, with only minimal per-query network cost using commodity TCP hardware (Dang et al., 10 Dec 2025).

6. Open Problems and Future Directions

Several limitations and research frontiers remain:

Dynamic updates and online index construction: BatANN and DSANN currently lack support for online index mutations with consistency (Dang et al., 10 Dec 2025, Yu et al., 20 Oct 2025).
Disk and network layout: Optimizations such as sector reordering or dynamic beam-widths (cf. Starling, PipeANN) may further reduce I/O (Dang et al., 10 Dec 2025).
Caching and load adaptive strategies: Adaptive partition caching and hot-shard load balancing are proposed to handle skew and reduce tail latency (Yu et al., 20 Oct 2025).
Parameter tuning: Automatic trade-off exploration (e.g., for aggregation size $p$ , beam width $L$ ) would enhance deployability (Yu et al., 20 Oct 2025).
Hardware accelerators and RDMA: While commodity TCP suffices for BatANN, RDMA-based approaches might provide incremental improvement under low-latency constraints (Dang et al., 10 Dec 2025).
Benchmarking against high-availability and geo-distributed settings: DSANN’s design is intended for deployment in variable-latency networks, a path for empirical exploration (Yu et al., 20 Oct 2025).

Advancing distributed disk-based vector search requires continued innovation in distributed index construction, asynchronous I/O overlap, network-aware beam traversal, and robust in-place mutation, to sustain fast, accurate, and scalable search over trillion-scale vector collections in heterogeneous, multi-tenant cloud environments.