Milvus HNSW & IVF ANN Indexing
- Milvus HNSW/IVF are advanced ANN indexing solutions that use graph-based and quantization methods to enable efficient dense retrieval in vector databases.
- HNSW leverages a multi-layer proximity graph with tunable parameters (M, efConstruction, efSearch) to optimize recall and latency, while IVF-Flat partitions the vector space into Voronoi cells for memory-efficient scanning.
- Empirical evaluations show HNSW achieves superior recall when using fine-grained chunking and high-dimensional embeddings, whereas IVF-Flat offers a balance between latency and memory usage in large-scale deployments.
Milvus HNSW and IVF-Flat refer to two leading approximate nearest neighbor (ANN) vector indexing backends implemented within the Milvus (v2.x) open-source vector database. Both serve as high-throughput, scalable sublinear-time solutions for dense retrieval in large embedding-based search systems. These indexes support a range of text embedding models and chunking strategies. Their comparative behavior, parameterization, and impact on search quality are documented in "Evaluating Embedding Models and Pipeline Optimization for AI Search Quality" (Zhong et al., 27 Nov 2025).
1. Principles and Data Structures
HNSW (Hierarchical Navigable Small World)
HNSW is a multi-layer proximity graph in which each layer forms a sparse "small-world" graph. Higher layers possess progressively fewer nodes with long-range links, and only the bottom layer contains all data points with their local links. Query-time search begins at the entry point in the top layer and follows a greedy procedure: at each level ℓ, neighbors are traversed to the node closest to the query, descending layer by layer. Upon reaching the lowest layer, a best-first search with a priority queue (of tunable size efSearch) is used to gather approximate K-nearest neighbors.
Key HNSW parameters include:
- : Number of bi-directional links per node (controls index sparsity/accuracy trade-off).
- efConstruction: Size of the dynamic candidate list during index build (effects graph quality vs. build cost).
- efSearch: Candidate list size during query (controls recall/latency trade-off).
IVF-Flat (Inverted File with Flat Quantization)
IVF-Flat partitions the vector space into coarse Voronoi cells (clusters) using a quantizer. Each vector is assigned to its nearest centroid and stored verbatim in its respective cell. Querying involves assigning a query to its nearest centroids and scanning all vectors in the top- cells with exact distance computation to identify top-K matches.
Key IVF-Flat parameters:
- nlist: Number of Voronoi partitions.
- nprobe: Number of clusters scanned per query, governing recall and latency.
2. Experimental Environment and Embedding Models
Experiments were conducted on Milvus (v2.x), deployed across multi-CPU, high-memory infrastructure within Cisco's internal compute cluster; embedding inference leveraged GPUs (NVIDIA A100/V100) for large-scale models. The evaluation dataset consists of 11,975 manually validated query-chunk pairs, derived from US City Council transcripts, with questions synthesized by an 8×7B Mistral LLM.
Embedding models and their dimensions included:
- all-mpnet-base-v2 (768-d)
- BGE-base-en-v1.5 (768-d)
- BGE-large-en-v1.5 (1024-d)
- GTE-base-en-v1.5 (768-d)
- GTE-large-en-v1.5 (1024-d)
- Qwen3-Embedding-0.6B (1024-d)
- Qwen3-Embedding-4B (2560-d)
- Qwen3-Embedding-8B (4096-d)
Chunking strategies assessed:
- Fixed 2,000-character length (baseline)
- Fixed 512-character length (fine-grained)
- Semantic chunking aligned with discourse boundaries
The pipeline: transcripts are preprocessed, embeddings generated per chunk, then indexed in Milvus for ANN retrieval.
3. Evaluation Metrics
Search quality and scalability were evaluated using the following metrics:
- Top-K Accuracy:
- Normalized Discounted Cumulative Gain (NDCG):
- Latency: Average query response time (ms) for top-K results.
- Memory Usage: Total RAM consumed (graph pointers for HNSW; cluster assignments for IVF).
4. Comparative Performance Analysis
The quantitative impact of index type, model dimension, and chunking on retrieval accuracy is summarized below for selected scenarios:
| Embedding/Chunking | Index Type | Acc@3 | NDCG@3 | Acc@5 | NDCG@5 |
|---|---|---|---|---|---|
| GTE-large, 2K-char chunk | HNSW | 0.412 | 0.356 | 0.462 | 0.378 |
| GTE-large, 512-char chunk | HNSW | 0.460 | 0.415 | 0.500 | 0.431 |
| GTE-large, 512-char chunk | IVF-Flat | 0.427 | 0.387 | 0.472 | 0.414 |
| Qwen3-8B (4096-d), 2K chunk | HNSW | 0.571 | 0.516 | 0.662 | 0.583 |
Key findings:
- Reducing chunk size from 2,000 to 512 characters increases Acc@3 by approximately 4.8 percentage points (pp) under HNSW.
- On 512-character chunks, HNSW outperforms IVF-Flat by ~3.3pp in Acc@3.
- IVF-Flat typically has slightly lower recall but consumes less memory and exhibits higher latency per candidate scan.
- For Qwen3-8B (4096-d), HNSW achieves the highest observed accuracy (Acc@3 = 0.571); IVF-Flat performance for this setting is unreported but, by analogy, is expected to trail by 3–5pp.
This suggests that for given compute resources and memory budgets, HNSW offers the best recall/accuracy, but IVF-Flat may suit extreme scale or memory-constrained environments.
5. Tuning and Optimization Strategies
Parameter tuning recommendations from empirical results:
HNSW
- : Increasing from 32 to 48 provides minor recall gains, with ~50% increase in graph edge count.
- efConstruction: Set (128–256) to enhance graph quality at a linear increase in build time.
- efSearch: (e.g., 128 for ) yields recall; further increases show diminishing returns.
IVF-Flat
- nlist: Optimal scaling is (e.g., M yields ).
- nprobe: nprobe = 8 reaches 90–95% recall; nprobe = 16–32 recovers most of HNSW recall at roughly linear latency cost.
Chunking and Embedding Dimension
- Finer-grained fixed chunks (512 characters) increase Top-K accuracy by pp; semantic chunking achieves similar gains and retains discourse coherence.
- Higher-dimensional embeddings (e.g., Qwen3-8B, 4096-d) deliver better accuracy but incur storage and $2$– higher query latency compared to 1024-d models.
Neural Re-ranking
- Two-stage pipelines (GTE-large retrieval, BGE cross-encoder re-ranking) boost Acc@3 from 0.412 to 0.506 for 2,000-character chunks (+9.4pp).
- Reranking bridges much of the accuracy gap for legacy or lower-grade indexes without requiring re-indexing.
6. Best Practices and Use-Case Fit
Principal recommendations, as established by the experimental results:
- Where resources permit, HNSW with –$48$, efConstruction 128, efSearch –$256$ is favored.
- For massive-scale or memory-limited deployments, IVF-Flat (, tuned to SLA) is advantageous.
- Fixed 512-character or semantic chunking is preferred to reduce retrieval noise.
- Lightweight cross-encoder reranking maximizes Top-K precision with minimal extra latency.
- A plausible implication is that two-stage retrieval architectures (approximate search plus neural reranking) represent a robust configuration for text-centric vector search pipelines at scale (Zhong et al., 27 Nov 2025).