Manifold-Consistent Graph Indexing (MCGI)
- The paper introduces a geometry-aware indexing framework that corrects the Euclidean–Geodesic mismatch in graph-based ANN search.
- MCGI employs Local Intrinsic Dimensionality to adaptively modulate pruning and beam width, improving recall and reducing query latency.
- Empirical results on datasets like GIST1M and SIFT1B show significant throughput gains and lower disk I/O compared to traditional methods.
Manifold-Consistent Graph Indexing (MCGI) is a geometry-aware, disk-resident indexing framework designed for scalable Approximate Nearest Neighbor (ANN) search in high-dimensional vector spaces. The core innovation of MCGI is its explicit correction of the “Euclidean–Geodesic mismatch” that arises when traditional graph-based ANN methods use Euclidean distances to approximate geodesic paths on non-Euclidean manifolds. MCGI leverages Local Intrinsic Dimensionality (LID) to adaptively modulate both pruning during index construction and beam width during querying, improving theoretical guarantees and empirical efficiency, particularly in billion-scale, high-dimensional vector search scenarios (Zhao, 5 Jan 2026).
1. Motivation and Euclidean–Geodesic Mismatch
Graph-based ANN indices such as DiskANN, HNSW, and NSG create sparse proximity graphs using Euclidean edge weights. In low or moderate dimensions, greedy or beam search over such graphs efficiently finds approximate nearest neighbors. However, high-dimensional datasets—such as GIST1M (960-D) or SIFT1B (128-D)—tend to lie on low-dimensional, curved manifolds embedded in an ambient high-D space. On such manifolds, Euclidean straight lines diverge from the manifold’s geodesics, leading searches to dead ends, frequent backtracking, and excessive random I/O, significantly degrading query throughput. This breakdown—termed the “Euclidean–Geodesic mismatch”—is particularly acute in regions of high local geometric complexity.
2. Local Intrinsic Dimensionality as a Geometric Signal
MCGI employs Local Intrinsic Dimensionality (LID) as a pointwise, data-driven measure of local manifold complexity. For a reference point , if the cumulative distribution function of distances to random points in the dataset is available, then the LID at is defined by:
When locally, . For practical datasets, the Levina–Bickel maximum likelihood estimator is used:
where are the nearest-neighbor distances to , typically with . LID estimation enables MCGI to quantify and spatially adapt to varying geometric complexities across the dataset.
3. Adaptive Index Construction and Querying
Index Build
The MCGI index construction extends the DiskANN two-stage pipeline:
- Geometric Calibration: Estimate for each node , then compute the global mean and standard deviation . Assign each node a pruning parameter using a Z-score and logistic mapping:
with , , and monotonically decreasing in LID, clamped for robustness.
- Manifold-Consistent Refinement: Starting from a random -regular graph, iteratively rewire each node by performing a beam search of width , collecting candidates, sorting by distance, and applying a node-specific adaptive occlusion test: keep edge only if, for all already selected,
This mirrors DiskANN’s Vamana refinement, but applies per-node pruning thresholds.
Querying
At query time, MCGI first estimates for the query vector using its nearest neighbors, then dynamically sets the beam width:
where for target failure probability and is empirically estimated. Standard beam search is used, retrieving graph edges via direct asynchronous I/O and prefetching.
4. Theoretical Guarantees and Complexity
Guarantee of Connectivity
MCGI’s pruning ensures each induced edge-lune is contained within the Relative Neighborhood Graph (RNG), which is itself a supergraph of the Euclidean Minimum Spanning Tree:
This inclusion guarantees that is connected, preserving the ability to reach any target node given a sufficiently large beam.
Search Reliability
By selecting according to local LID, MCGI ensures a uniform recall guarantee across the manifold, regardless of local complexity.
Computational Complexity
- Index Build: LID calibration requires (using approximate -NN); refinement scales as over passes, maintaining scaling with graph degree .
- Space: Total storage is for edges and for LID values.
- Query: LID estimation per query is ; beam search requires distance calculations; disk I/O is dominated by new node accesses, typically less than due to cache/prefetch.
5. Empirical Performance and Comparative Analysis
Experiments on Xeon Platinum 8380 (80 cores), 256 GiB RAM, 480 GB NVMe SSD show:
- High-Dimensional Regimes: On GIST1M (960-D), MCGI achieves 5.8 higher QPS at 95% recall (DiskANN: 64.7 QPS; MCGI: 375 QPS) and 55% higher throughput at 97% recall.
- Billion-Scale Search: On SIFT1B ( points, 128-D), MCGI achieves 3 lower mean query latency (DiskANN: 49.06 ms; MCGI: 16.20 ms) and 1.32 higher QPS at 90% recall.
- Low-Dimensional Performance: On SIFT1M (128-D) and GloVe-100, MCGI matches DiskANN’s QPS at 98% recall, confirming minimal overhead on simple manifolds.
- Resource Sensitivity: Geometry-aware pruning of MCGI yields nearly identical recall-to-beam curves as DiskANN, with up to 2 reduction in tail latency (99th percentile) at high recall.
- I/O Efficiency: In high-dimensional scenarios, MCGI reduces total random disk I/O by 30–60% relative to baselines.
6. Comparison to Prior Approaches
Unlike DiskANN’s static pruning and beam settings, MCGI introduces node-specific and query-adaptive beam width , removing the need for global hyperparameter tuning and achieving uniform performance across heterogeneous local geometries. Compared to SPANN’s IVF-centroid routing, MCGI maintains strong graph connectivity and high-recall efficiency. Storage footprint is equivalent to DiskANN, but with lower I/O in high-dimensional settings.
7. Practical Considerations, Limitations, and Prospects
MCGI’s effectiveness relies on the validity of the manifold hypothesis and robust LID estimation. In regions of sparse or noisy neighbors, the maximum likelihood LID may misestimate, resulting in suboptimal pruning or routing; this is mitigated with Z-scoring and a logistic “clamp.” The geometric calibration adds one-time index build cost and substantial scratch memory overhead for billion-scale datasets (e.g., 200 GB RAM for SIFT1B), though amortized across many queries.
Future directions include dynamic or streaming datasets with incremental LID recalibration, exploring alternate local geometric statistics beyond LID, and generalization to non-Euclidean or learned metric spaces, such as manifolds induced by deep embedding methods (Zhao, 5 Jan 2026).