Papers
Topics
Authors
Recent
Search
2000 character limit reached

Manifold-Consistent Graph Indexing (MCGI)

Updated 12 January 2026
  • The paper introduces a geometry-aware indexing framework that corrects the Euclidean–Geodesic mismatch in graph-based ANN search.
  • MCGI employs Local Intrinsic Dimensionality to adaptively modulate pruning and beam width, improving recall and reducing query latency.
  • Empirical results on datasets like GIST1M and SIFT1B show significant throughput gains and lower disk I/O compared to traditional methods.

Manifold-Consistent Graph Indexing (MCGI) is a geometry-aware, disk-resident indexing framework designed for scalable Approximate Nearest Neighbor (ANN) search in high-dimensional vector spaces. The core innovation of MCGI is its explicit correction of the “Euclidean–Geodesic mismatch” that arises when traditional graph-based ANN methods use Euclidean distances to approximate geodesic paths on non-Euclidean manifolds. MCGI leverages Local Intrinsic Dimensionality (LID) to adaptively modulate both pruning during index construction and beam width during querying, improving theoretical guarantees and empirical efficiency, particularly in billion-scale, high-dimensional vector search scenarios (Zhao, 5 Jan 2026).

1. Motivation and Euclidean–Geodesic Mismatch

Graph-based ANN indices such as DiskANN, HNSW, and NSG create sparse proximity graphs using Euclidean edge weights. In low or moderate dimensions, greedy or beam search over such graphs efficiently finds approximate nearest neighbors. However, high-dimensional datasets—such as GIST1M (960-D) or SIFT1B (128-D)—tend to lie on low-dimensional, curved manifolds embedded in an ambient high-D space. On such manifolds, Euclidean straight lines diverge from the manifold’s geodesics, leading searches to dead ends, frequent backtracking, and excessive random I/O, significantly degrading query throughput. This breakdown—termed the “Euclidean–Geodesic mismatch”—is particularly acute in regions of high local geometric complexity.

2. Local Intrinsic Dimensionality as a Geometric Signal

MCGI employs Local Intrinsic Dimensionality (LID) as a pointwise, data-driven measure of local manifold complexity. For a reference point xx, if the cumulative distribution function of distances Fx(r)F_x(r) to random points in the dataset is available, then the LID at xx is defined by:

LID(x)limr0rFx(r)Fx(r)=limr0dlnFx(r)dlnr\mathrm{LID}(x) \triangleq \lim_{r\to 0}\frac{r F_x'(r)}{F_x(r)} = \lim_{r\to 0} \frac{d\ln F_x(r)}{d\ln r}

When Fx(r)CrdF_x(r) \approx C r^d locally, LID(x)d\mathrm{LID}(x) \approx d. For practical datasets, the Levina–Bickel maximum likelihood estimator is used:

LID^(x)=(1ki=1klnrirk)1\widehat{\mathrm{LID}}(x) = -\left( \frac{1}{k} \sum_{i=1}^k \ln \frac{r_i}{r_k} \right)^{-1}

where r1rkr_1 \le \cdots \le r_k are the kk nearest-neighbor distances to xx, typically with k[32,100]k\in[32,100]. LID estimation enables MCGI to quantify and spatially adapt to varying geometric complexities across the dataset.

3. Adaptive Index Construction and Querying

Index Build

The MCGI index construction extends the DiskANN two-stage pipeline:

  • Geometric Calibration: Estimate LID^(u)\widehat{\mathrm{LID}}(u) for each node uu, then compute the global mean μ\mu and standard deviation σ\sigma. Assign each node a pruning parameter using a Z-score and logistic mapping:

α(u)=Φ(LID^(u))=αmin+(αmaxαmin)[1+exp(z(u))]1,z(u)=LID^(u)μσ\alpha(u) = \Phi(\widehat{\mathrm{LID}}(u)) = \alpha_{\min} + (\alpha_{\max} - \alpha_{\min}) \bigl[1 + \exp(z(u))\bigr]^{-1}, \quad z(u) = \frac{\widehat{\mathrm{LID}}(u) - \mu}{\sigma}

with αmin=1.0\alpha_{\min} = 1.0, αmax=1.5\alpha_{\max}=1.5, and α(u)\alpha(u) monotonically decreasing in LID, clamped for robustness.

  • Manifold-Consistent Refinement: Starting from a random RR-regular graph, iteratively rewire each node by performing a beam search of width LbuildL_{\text{build}}, collecting candidates, sorting by distance, and applying a node-specific adaptive occlusion test: keep edge (u,v)(u,v) only if, for all nn already selected,

α(u)d(n,v)>d(u,v)\alpha(u)d(n,v) > d(u,v)

This mirrors DiskANN’s Vamana refinement, but applies per-node pruning thresholds.

Querying

At query time, MCGI first estimates LID^(q)\widehat{\mathrm{LID}}(q) for the query vector qq using its nearest neighbors, then dynamically sets the beam width:

L(q)=Cδexp(λLID^(q))L(q) = C_\delta \exp(\lambda \widehat{\mathrm{LID}}(q))

where Cδ=lnδC_\delta = -\ln\delta for target failure probability δ\delta and λ\lambda is empirically estimated. Standard beam search is used, retrieving graph edges via direct asynchronous I/O and prefetching.

4. Theoretical Guarantees and Complexity

Guarantee of Connectivity

MCGI’s pruning ensures each induced edge-lune is contained within the Relative Neighborhood Graph (RNG), which is itself a supergraph of the Euclidean Minimum Spanning Tree:

EEMSTERNGEMCGIE_{\mathrm{EMST}} \subseteq E_{\mathrm{RNG}} \subseteq E_{\mathrm{MCGI}}

This inclusion guarantees that GMCGIG_{\mathrm{MCGI}} is connected, preserving the ability to reach any target node given a sufficiently large beam.

Search Reliability

By selecting L(q)L(q) according to local LID, MCGI ensures a uniform recall guarantee δ\delta across the manifold, regardless of local complexity.

Computational Complexity

  • Index Build: LID calibration requires O(NlogN)O(N\log N) (using approximate kk-NN); refinement scales as O(TNRlogLbuild)O(T N R \log L_{\text{build}}) over TT passes, maintaining O(N)O(N) scaling with graph degree RR.
  • Space: Total storage is O(NR)O(NR) for edges and O(N)O(N) for LID values.
  • Query: LID estimation per query is O(klogN)O(k\log N); beam search requires O(L(q)logL(q))O(L(q)\log L(q)) distance calculations; disk I/O is dominated by new node accesses, typically less than L(q)L(q) due to cache/prefetch.

5. Empirical Performance and Comparative Analysis

Experiments on Xeon Platinum 8380 (80 cores), 256 GiB RAM, 480 GB NVMe SSD show:

  • High-Dimensional Regimes: On GIST1M (960-D), MCGI achieves 5.8×\times higher QPS at 95% recall (DiskANN: 64.7 QPS; MCGI: 375 QPS) and 55% higher throughput at 97% recall.
  • Billion-Scale Search: On SIFT1B (10910^9 points, 128-D), MCGI achieves 3×\times lower mean query latency (DiskANN: 49.06 ms; MCGI: 16.20 ms) and 1.32×\times higher QPS at \sim90% recall.
  • Low-Dimensional Performance: On SIFT1M (128-D) and GloVe-100, MCGI matches DiskANN’s QPS at 98% recall, confirming minimal overhead on simple manifolds.
  • Resource Sensitivity: Geometry-aware pruning of MCGI yields nearly identical recall-to-beam curves as DiskANN, with up to 2×\times reduction in tail latency (99th percentile) at high recall.
  • I/O Efficiency: In high-dimensional scenarios, MCGI reduces total random disk I/O by 30–60% relative to baselines.

6. Comparison to Prior Approaches

Unlike DiskANN’s static pruning and beam settings, MCGI introduces node-specific α(u)\alpha(u) and query-adaptive beam width L(q)L(q), removing the need for global hyperparameter tuning and achieving uniform performance across heterogeneous local geometries. Compared to SPANN’s IVF-centroid routing, MCGI maintains strong graph connectivity and high-recall efficiency. Storage footprint is equivalent to DiskANN, but with lower I/O in high-dimensional settings.

7. Practical Considerations, Limitations, and Prospects

MCGI’s effectiveness relies on the validity of the manifold hypothesis and robust LID estimation. In regions of sparse or noisy neighbors, the maximum likelihood LID may misestimate, resulting in suboptimal pruning or routing; this is mitigated with Z-scoring and a logistic “clamp.” The geometric calibration adds one-time index build cost and substantial scratch memory overhead for billion-scale datasets (e.g., 200 GB RAM for SIFT1B), though amortized across many queries.

Future directions include dynamic or streaming datasets with incremental LID recalibration, exploring alternate local geometric statistics beyond LID, and generalization to non-Euclidean or learned metric spaces, such as manifolds induced by deep embedding methods (Zhao, 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Manifold-Consistent Graph Indexing (MCGI).