Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 33 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 465 tok/s Pro
Kimi K2 205 tok/s Pro
2000 character limit reached

Disk-Based Graph Indices: Advances & Applications

Updated 24 August 2025
  • Disk-based graph indices are advanced storage systems that partition and organize massive graphs to minimize non-sequential I/O and support scalable analytics.
  • They employ techniques like bi-sharding, LSM-tree compaction, and pipelined processing to balance high update throughput with low query latency.
  • These indices power applications in social networks, bioinformatics, mapping, and AI vector search through innovative disk layouts and proximity graph designs.

Disk-based graph indices are specialized data structures and storage formats enabling efficient query and analytic processing of graphs that are far larger than available main memory. They are engineered to minimize costly non-sequential I/O, strike a balance between update throughput and query latency, and support both traditional graph algorithms and modern vector search at the terabyte and billion-edge scale. Approaches in this area span from analytical systems and general-purpose graph databases to specialized vector AI indices, leveraging principles from computational geometry, kernel methods, and storage system design for SSDs and distributed clusters.

1. Foundational Principles and early Systems

The foundational challenge for disk-based graph indices is enabling locality and minimizing non-sequential disk I/O when accessing arbitrary subgraphs, neighbors, or attribute-rich edge lists. Early systems such as GraphChi and its successors moved beyond classical adjacency-list or edge-list formats by partitioning graphs so that only small subgraphs must reside in memory at a time.

  • Shard-based models: The BiShard Parallel Processor (BPP) (Najeebullah et al., 2014) extended GraphChi’s “parallel sliding windows” scheme by storing two copies of each edge (an in-shard and an out-shard per interval), eradicating race conditions in the update phase and reducing random I/O to exactly two per interval. This allowed full CPU parallelism and, through bi-sharding, cut I/O overhead by up to half compared to GraphChi.
  • Partitioned Adjacency Lists: The PAL structure underlying GraphChi-DB (Kyrola et al., 2014) partitions the vertex space and stores edges in a way that enables efficient in- and out-edge queries without duplication, leveraging compressed pointer arrays and per-partition linked lists together with an LSM-tree for insertions.
  • Buffered and pipelined analytics: BigSparse (Jun et al., 2017) further externalized processing by storing both edges and vertices on disk, orchestrating analytics via a pipelined architecture that logs updates and applies multi-level sort-reduce so that the bulk of disk traffic is sequential and append-only.

These early solutions transformed the practicality of single-server, out-of-core analytics by supporting billion-edge graphs with modest RAM, leveraging disk locality, careful partitioning, and auxiliary indices for neighbor lookups.

2. Data Layouts, Index Structures, and Performance-Oriented Design

A core advancement in disk-based graph indices is the tailoring of disk layouts and index structures to optimize both analytical and transactional workloads:

Compaction- and Update-Friendly Storage:

  • LSMGraph (Yu et al., 10 Nov 2024) merges an LSM-tree architecture with a multi-level disk-resident CSR format. Updates are staged in a memory cache (MemGraph), flushed in bulk as a CSR at the lowest level, and merged upward in the style of LSM compaction. A multi-level per-vertex index enables efficient discovery of the relevant CSR blocks on disk. Vertex-level version control ensures snapshot-consistent query execution even during ongoing compactions.

Attribute- and Query-Aware Block Organization:

  • The railway layout (Soulé et al., 2014) adapts disk block storage by partitioning each block into attribute-specific sub-blocks, each replicating the graph structure. With partitioning optimized via integer programming or scalability-focused heuristics, this approach minimizes I/O for historical or attribute-centric queries on interaction graphs, at the expense of controlled storage overhead.

Compact and Expressive Indices:

  • FlashGraph (Zheng et al., 2014) features a compact in-memory index of vertex states and edge list offsets, with the full edge set on SSD. Adaptive horizontal and vertical partitioning, selective data loading, and I/O merging enable vertex-centric algorithms to approach in-memory performance, with only active vertices reading their edge lists per iteration.

Spatial and Windowed Graph Indexing:

  • For visual exploration of giant RDF graphs, graphVizdb (Bikakis et al., 2015) builds a disk-resident R-tree index over node and edge coordinates (derived from offline layout). User window queries translate spatial viewports directly to range scans in the R-tree, allowing scalable interactive visualization independent of graph scale.
  • Window analytics on dynamic graphs (Fan et al., 2015) benefit from computation-sharing index structures such as the Dense Block Index (DBIndex) for overlapping k-hop neighborhoods and the Inheritance Index (I-Index) for DAGs, achieving several orders of magnitude in query acceleration compared to naive enumeration.

3. Advances in Vector Similarity Search and Proximity Graph Indices

Recent progress centers on disk-based proximity graphs as indices for high-dimensional vector search, with implications for AI and recommendation systems where both search and updates must be supported at scale.

Block-Level Locality and Navigation Graphs:

  • Starling (Wang et al., 4 Jan 2024) introduces a two-layer approach: a small in-memory navigation graph provides efficient disk entry points; disk-resident graphs are block-shuffled to maximize the overlap between a vertex’s neighbors and its block (quantified via the overlap ratio), so a single I/O delivers many relevant candidates. This design, combined with a “block search” algorithm and pipelined computation, achieves sub-millisecond latency and >40× throughput improvements compared to prior disk-based indices on 33M-vector datasets.

In-Place Updatable Graphs for Streaming ANNS:

  • IP-DiskANN (Xu et al., 19 Feb 2025) addresses the classic deletion bottleneck in singly-linked proximity graphs by leveraging GreedySearch to approximate in-neighbors on deletions, then restoring graph connectivity with a small number of replacement edges per updated neighbor. This achieves streaming index maintenance without heavy, periodic batch consolidations, yielding stable recall under high update rates.

Kernel-Based Graph Construction and Formal Guarantees:

  • The Support Vector Graph (SVG) framework (Tepper et al., 25 Jun 2025) formulates neighbor selection as a kernel regression (equivalently, SVM) problem, enabling the construction of graph indices with provable navigability in both metric and non-metric spaces. The SVG-L0 extension ensures a bounded out-degree via an explicit 0\ell_0 sparsity constraint, replacing heuristic pruning with nonnegative regression and subspace pursuit, and eliminating the reliance on candidate pools.

Predicate-Agnostic kNN in Graph DBMSs:

  • NaviX (Sehgal et al., 29 Jun 2025) integrates HNSW-based indices natively into graph DBMSs, allowing efficient kNN over arbitrary predicate-selected subsets (“predicate-agnostic” queries). Query execution is prefiltered—exploring only nodes in a pre-identified selection—using adaptive per-node local selectivity to choose among one-hop, two-hop, or directed exploration heuristics. Disk-resident storage leverages existing DBMS structures (e.g., CSR).

4. Distributed and Semi-External Scalability

Scaling disk-based graph indices across machines and exploiting SSD arrays raise additional engineering and algorithmic challenges:

  • Fully Out-of-Core Distributed Engines: DFOGraph (Yu et al., 2021) orchestrates a two-level partitioning (across nodes and within-node batches). Each edge chunk is adaptively stored as either CSR or doubly-compressed DCSR, and pipelined message dispatch optimizes both disk and network I/O. The push-only model, overlapping computation and communication, and careful batch sizing produce superlinear speedups and scale to billion-edge graphs.
  • Semi-External Memory Libraries: Graphyti (Mhembere et al., 2019) embodies SEM principles—edges on disk, vertices in RAM—supplemented by push-based computation, messaging discipline (e.g., hybrid multicast/point-to-point), and algorithmic pruning to minimize superfluous I/O. Algorithms must be I/O-aware and often require tailored logic for efficient disk accesses.

5. Applications, Use Cases, and Empirical Results

Disk-based graph indices underlie analytics and retrieval in web-scale social networks, e-commerce graphs, bioinformatics, mapping, knowledge bases, and vector AI. Empirical results consistently show:

System Highest Scale Demonstrated Typical Latency/Throughput Key Strength
GraphChi-DB 1.5B edges (commodity PC) 250K insert/sec, fast queries Efficient bulk loading and online queries
FlashGraph 129B edges, 3.4B nodes 80% of in-memory speed (PR, BFS) Selective access, I/O merging, parallelism
Starling 33M vectors (per segment) <1ms, 44× throughput improvement Block overlap, navigation index, pipeline
IP-DiskANN ~50M vectors High QPS, stable recall In-place, disruption-free streaming updates
LSMGraph 107 – 109 edge range 2.85–36× update/scan speedup Multilevel index, snapshot-isolated queries

Empirical evaluations compare against specialized and general-purpose systems, highlighting that disk-based indices can routinely achieve order-of-magnitude improvements over conventional adjacency lists, edge lists, or non-locality-aware block storage—provided the index structures exploit access patterns and hardware characteristics.

6. Limitations, Challenges, and Future Directions

Despite significant advances, several challenges and open questions persist:

  • Update Amplification and Versioning: Frequent updates can induce significant write amplification (especially for strictly CSR-based layouts), necessitating hybrid schemes (e.g., LSMGraph) and per-vertex snapshot mechanisms to mitigate concurrent-update anomalies.
  • Index Size and Memory Footprint: Particularly in vector search, ensuring that navigation layers or compact indices fit within constrained RAM remains key. Block shuffling and quantization are two practical mitigation strategies.
  • Query Flexibility vs. Indexing Overhead: Predicate-agnostic and window queries place stress on index structures, especially when local partitioning or spatial layouts do not match evolving workloads.
  • Generality of Navigability Guarantees: Euclidean proximity underpins most geometric graph indices, but new methods (kernel approaches, SVG) extend rigorous search guarantees to general metric and non-metric settings.
  • Distributed Coordination: Systems like DFOGraph and Kinetica-Graph (Karamete et al., 2022) highlight that minimizing cross-partition communication and enabling efficient rebalancing are crucial for large-scale, multi-node deployments.

A plausible implication is that future iterations of disk-based graph indices will increasingly integrate adaptive data layouts, hybrid memory/disk architectures, machine-learned edge selection, and workload-aware partitioning to further reduce I/O, improve update agility, and enable richer analytical and vector search workloads.

7. Conclusion

Disk-based graph indices constitute a diverse, evolving toolkit enabling scalable, efficient, and flexible handling of massive graphs. Their trajectory—from interval-sharded and block-oriented storage through LSM/CSR hybrids, external-memory proximity graphs, and kernel-driven index construction—demonstrates the field’s responsiveness to both hardware shifts (SSDs, distributed storage) and workload complexity (dynamic updates, filtered queries, vector search). Continued cross-pollination among database design, external-memory algorithms, and vector search is expected to drive further advances, especially in areas requiring robust performance guarantees and adaptation to non-Euclidean similarity spaces.