A Super Fast K-means for Indexing Vector Embeddings

Published 20 Mar 2026 in cs.LG, cs.DB, and cs.IR | (2603.20009v1)

Abstract: We present SuperKMeans: a k-means variant designed for clustering collections of high-dimensional vector embeddings. SuperKMeans' clustering is up to 7x faster than FAISS and Scikit-Learn on modern CPUs and up to 4x faster than cuVS on GPUs (Figure 1), while maintaining the quality of the resulting centroids for vector similarity search tasks. SuperKMeans acceleration comes from reducing data-access and compute overhead by reliably and efficiently pruning dimensions that are not needed to assign a vector to a centroid. Furthermore, we present Early Termination by Recall, a novel mechanism that early-terminates k-means when the quality of the centroids for retrieval tasks stops improving across iterations. In practice, this further reduces runtimes without compromising retrieval quality. We open-source our implementation at https://github.com/cwida/SuperKMeans

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces SuperKMeans, a fast k-means algorithm that integrates dimension pruning and GEMM operations for high-dimensional vector embeddings.
It achieves up to 7x speedups on CPU and 4x on GPU by pruning over 97% of centroids and employing recall-driven early termination.
Empirical evaluations show scalability over diverse architectures and datasets, ensuring minimal quality degradation for vector search indexing.

SuperKMeans: Accelerating K-Means for High-Dimensional Vector Embedding Indexing

Introduction and Motivation

The scalability and efficiency of clustering high-dimensional vector embeddings are essential for contemporary vector search systems, particularly those underpinning large-scale information retrieval, recommendation engines, and retrieval-augmented generation (RAG). Traditional $k$ -means implementations, including those found in state-of-the-art libraries such as FAISS and Scikit-Learn, remain limited by their reliance on a vanilla Lloyd's algorithm, with primary optimizations tethered to highly tuned BLAS GEMM routines. However, the need for large $k$ and high $d$ values in partition-based indexes (e.g., IVF) exposes $k$ -means to severe compute bottlenecks.

SuperKMeans addresses these bottlenecks by tightly integrating dimension pruning via adaptive sampling and progressive thresholding, interleaved with GEMM operations and specialized memory layouts. This results in substantial speedups—up to 7x over CPU baselines and 4x over GPU baselines—while maintaining clustering quality and index suitability for vector similarity search (VSS).

Figure 1: Clustering performance for all competitors in OpenAI's 1M vectors of 1536 dimensions; SuperKMeans yields up to 7x (CPU) and 4x (GPU) speedups as $k$ grows.

Algorithmic Contributions

SuperKMeans' acceleration centers on two-phase iteration within the assignment step of Lloyd's algorithm:

Phase 1 (GEMM): Compute partial distances using the leading $d'$ dimensions via GEMM, optimizing cache locality and SIMD utilization.
Phase 2 (PRUNING): Employ Adaptive Sampling (ADSampling) and memory-resident centroid layouts (PDX) to prune over 97% of centroids before full-dimensional computations. Progressive pruning with block decomposition further reduces unnecessary arithmetic.
Figure 2: SuperKMeans iteration phases: first, GEMM for partial distances; second, PRUNING for full assignment via dimension-wise selective evaluation.

Early Termination by Recall (ETR) is introduced, stopping iterations once further improvements in VSS recall plateau, eliminating guesswork for iteration count and reducing redundant computation. ETR can be transferred to any iterative clustering pipeline.

Empirical Evaluation

Clustering Time and Quality

SuperKMeans' superiority is demonstrated across ten modern embedding datasets and multiple CPU microarchitectures—Intel Granite Rapids, AMD Zen variants, AWS Graviton, and Apple M4—using dataset-specific BLAS kernels. With $k = 4\sqrt{N}$ , SuperKMeans achieves 3-4x speedups in high-dimensional settings and matches FAISS and Scikit-Learn in recall and cluster balance.

Figure 3: Clustering time (CPU) for various competitors across multiple datasets; SuperKMeans excels in $d > 384$ settings.

Scalability with k

As $k$ increases to 100,000, SuperKMeans maintains or increases its speedup factor relative to competitors, confirming its suitability for large-scale vector indexes. At low $k$ , performance converges due to diminished benefits of pruning (all centroids fit within cache).

Figure 4: Speedup vs k on CPU; SuperKMeans accelerates increasingly with larger k.

Impact of Sampling and Iterations

SuperKMeans maintains high recall even when clustering with 20–30% of the data, and cluster balance stabilizes at these sampling rates. ETR generally terminates after 5–10 iterations, minimizing computational cost with negligible recall degradation.

Figure 5: ETR triggers fewer iterations with minor impact on VSS recall; cluster balance stabilizes rapidly.

Figure 6: Sampling effects: minimal recall loss, moderate increase in vectors explored at low sample fractions.

Multithreading and Microarchitecture Portability

SuperKMeans scales nearly linearly with thread count and demonstrates consistent speedups across all SIMD instruction sets and CPU architectures, outperforming both FAISS and Scikit-Learn, especially when register sizes are limited (Graviton 4, Apple M4).

Figure 7: SuperKMeans speedups on diverse CPU microarchitectures, demonstrating portability.

Initialization and Ablation Results

Random centroid initialization matches or surpasses $k$ -means++ in speed and index quality for VSS. Progressive pruning, PDXification, and threshold adjustment combine for maximal speedup; omitting these optimizations incurs substantial slowdowns.

Figure 8: $k$ -means++ initialization is orders of magnitude slower and not beneficial for index quality.

GPU Implementation

On NVIDIA H100, SuperKMeans retains its two-phase architecture, utilizing cuBLAS for GEMM and custom CUDA kernels for block-wise pruning. Aggressive parallelization and register-level memory overlapping mitigate occupancy bottlenecks. Speedups against FAISS and cuVS are most pronounced at large k; preprocessing cost on CPU constrains speedup for small k.

Figure 9: SuperKMeans achieves up to 4x GPU speedup as k increases; preprocessing is amortized for large k.

Hierarchical SuperKMeans

Hierarchical $k$ -means, where SuperKMeans is deployed in both meso- and fine-clustering phases, achieves even greater speedups—up to 250x over FAISS—enabling blazing-fast clustering for very large collections (>10M vectors). The trade-off is minor recall degradation (<3%) but improved cluster balance and reduced vectors explored in VSS.

Figure 10: Hierarchical SuperKMeans attains unprecedented 250x clustering speedup vs FAISS.

Theoretical and Practical Implications

SuperKMeans is applicable to any variant requiring fast assignment over large $k$ and $d$ . ETR simplifies hyperparameter tuning and can benefit iterative quantization and refinement pipelines, e.g., Product Quantization and hybrid indexes. The principal insight is that most centroids are functionally irrelevant for assignment in high-dimensional spaces, enabling pragmatic acceleration by design.

Furthermore, the core pruning motif and memory layout advances (PDX) also transfer to VSS, drastically improving query speed and footprint relative to traditional IVF and graph-based (HNSW) approaches.

Conclusion

SuperKMeans decisively advances the practical speed and scalability of $k$ -means clustering for high-dimensional vector embedding collections used in modern vector search engines. By combining interleaved GEMM with dimension-adaptive pruning, and by plugging in recall-driven early termination, SuperKMeans accomplishes superior runtime without quality loss, on both CPUs and GPUs. Its modular design enables integration with hierarchical and minibatch variants, as well as direct application to advanced quantization and indexing methods. The findings indicate that state-of-the-art vector search systems should migrate toward SuperKMeans-based index pipelines to unlock efficient, scalable embedding clustering.

Future work includes integration with dimensionality-reduced embeddings, further GPU kernel optimization (e.g., Flash-KMeans), and expanding SuperKMeans to additional iterative index construction paradigms.

Figure 11: PDX pruning, originally motivated by SuperKMeans, makes IVF competitive with HNSW, reducing index build time and storage footprint.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

A super-fast way to group big vectors for search

What this paper is about

This paper introduces SuperKMeans, a faster way to run k-means clustering on very large, high‑dimensional vectors (long lists of numbers), like the ones used by AI models. The goal is to build vector search indexes more quickly without hurting search quality. Think of it as a faster “organizer” that helps search systems (like recommendation engines or AI assistants) find similar items quickly.

What questions the paper tries to answer

Can we make k-means clustering much faster for large, high‑dimensional data used in vector search?
Can we keep search quality (how well we find the right matches) the same while going faster?
Can we tell when to stop k-means early, as soon as it stops helping search results?
How many k-means iterations and how much data do we really need to get great results?
Do fancy starting choices (like k-means++) actually help here—or hurt?

Key ideas and how the method works (in simple terms)

A quick reminder: what is k-means?

You have lots of points (vectors). You want to group them into k groups (clusters).
Each group has a center (called a centroid).
You repeatedly: 1) assign each point to its closest center, 2) move centers to the middle of their assigned points, 3) repeat until things stop changing much.

This is used in vector search (like IVF indexes) to route a query to the most promising clusters first, so you don’t have to check everything.

The core trick: “peek early, skip most”

SuperKMeans speeds things up by not doing full work for every comparison:

Phase 1: Partial check (fast peek)
- It first compares only a small portion (about 12.5%) of each vector’s dimensions using super‑fast matrix math (GEMM; think “turbo calculator for big matrix multiplications”). This gives a quick, rough idea of which centers are promising.
Phase 2: Pruning (skip the rest)
- Using that rough info, it skips fully comparing most centers that are clearly not the best match. It only finishes the check for a tiny fraction of possible centers.
- It uses a technique called Adaptive Sampling (ADSampling) to decide safely when to stop early. In practice, it skips over 97% of unnecessary checks.

To make Phase 2 fast, SuperKMeans stores centers in a memory layout (PDX) that lets the computer read the right pieces quickly. It also rotates the data once at the start (a standard trick) so useful information is spread across dimensions more evenly.

Stop early when search quality stops improving

SuperKMeans adds Early Termination by Recall (ETR):

It watches the actual search quality (“recall”—how many of the true nearest neighbors we find) after each iteration.
If recall stops improving beyond a tiny amount for two rounds, it stops early.
This saves time and avoids wasting iterations that don’t help.

Small extra improvements

It adjusts how big the “peek” is (the number of early dimensions) so that pruning stays effective (aiming to prune about 95–97% of centers).
It supports sampling: you can cluster on only part of your data to go faster, then still get good search quality.

What did they test and how?

They compared SuperKMeans to popular libraries: FAISS and Scikit‑Learn on CPUs, and cuVS on GPUs.
They used many real vector datasets (text, images, word embeddings) with up to 10 million vectors and hundreds to thousands of dimensions.
They measured:
- Speed (time to build clusters/index),
- Search quality (recall) when those clusters are used in a vector search index (IVF),
- How many vectors need to be checked during search,
- How performance scales with more clusters (larger k) and more threads/cores.

Main results and why they matter

Much faster, same quality:
- Up to about 7× faster than FAISS and Scikit‑Learn on modern CPUs.
- Up to about 4× faster than cuVS on GPUs.
- When combined with hierarchical k‑means, they show a case that is about 250× faster than FAISS for indexing 10M vectors (1024 dimensions).
- Despite the speedups, search quality (recall) is essentially the same as competitors.
Early stop saves time:
- ETR often stops earlier than FAISS’s default and much earlier than Scikit‑Learn’s, with almost no loss in recall.
Big k and high dimensions are where it shines:
- The more clusters you want (big k) and the higher the dimensions, the bigger the speedup.
You don’t need many iterations or all the data:
- Just 5–10 iterations are usually enough to reach top search quality.
- Using only 20–30% of your data to train the clusters often gives you near‑best results, though using more data can balance clusters better (which can reduce search work later).
k-means++ isn’t helpful here:
- For these high‑dimensional embedding datasets, fancy initialization like k‑means++ can actually hurt final index quality. Simple random initialization works well.

Why this matters (impact)

Faster index building:
- Systems like search engines, recommenders, and RAG applications can build or refresh their vector indexes much faster—saving time and money.
Works across hardware:
- It runs well on many CPUs (Intel, AMD, ARM/Apple) and GPUs, making it practical in data centers or the cloud.
Better developer experience:
- ETR removes guesswork about “how many iterations to run,” stopping automatically when improvements stop.
Greener computing:
- Doing far less unnecessary work reduces energy use.
Open source:
- The code is available, so others can use, test, and improve it.

In short, SuperKMeans keeps the good search results you expect, but gets there much faster by doing smart early checks, skipping most of the unnecessary work, and stopping as soon as it stops helping.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, organized by theme to guide future work.

Algorithmic guarantees and failure modes

Lack of formal correctness/quality guarantees for ADSampling-driven pruning within Lloyd’s assignments: quantify worst-case misassignment probability, bound on k-means objective increase per iteration, and conditions under which pruning remains exact.
No theoretical analysis of convergence under approximate assignments: does SuperKMeans converge to a stationary point of the $k$ -means objective, and how do pruning-induced errors affect convergence rate and fixed points?
Unclear failure modes and dataset regimes where pruning becomes harmful (e.g., very high intrinsic dimensionality, nearly isotropic distributions, small $k$ , or highly imbalanced clusters). Provide diagnostic criteria to decide when to disable pruning or revert to full GEMM.
Effect of pruning on centroid drift and bias across iterations is unquantified: does early misassignment create systematic centroid displacement that persists despite later iterations?
Impact of empty-cluster handling (UpdateCentroidsAndSplitEmpty) on stability and balance is not detailed: compare alternative split strategies and quantify their effect on WCSS, recall, and explored vectors.

Hyperparameters and auto-tuning

The target pruning ratio (95–97%) and initial $d'$ ≈ 12.5% of $d$ are empirical; provide principled methods (e.g., online control, bandits, or Bayesian optimization) to learn $d'$ and adjustment step sizes per dataset and iteration.
Sensitivity analysis is missing for key knobs: $d'$ step size (20%), PDX block size (64), chunk sizes for $X$ / $Y$ , and OpenMP chunk size (8). Provide cross-architecture guidelines and automatic tuning policies.
First-iteration policy is fixed to full GEMM; investigate whether partial GEMM + PRUNING is viable in iteration 1 (possibly with a conservative $d'$ ) or if a brief warm-up on a sample can safely skip full GEMM.

Early Termination by Recall (ETR)

ETR uses recall measured on user-supplied or data-sampled queries; risk of overfitting to non-representative queries is unaddressed. Evaluate ETR on held-out query sets or production logs to ensure generalization.
Sensitivity of ETR to search parameters is unexplored: termination may differ across $n_{\text{probe}}$ , top- $K$ , and metric; provide a robust, multi-parameter ETR or a procedure to calibrate ETR for deployment settings.
Tolerance and patience values (e.g., 0.005, patience=2) are empirical; propose criteria or automated selection based on variance of recall estimates and desired confidence.
For large collections (e.g., $N \geq 10^7$ ), computing ground truth for 1000 queries may no longer be “minimal” overhead; evaluate scalable/approximate ground truth strategies and their impact on ETR decisions.

Generality across metrics and data types

The method is primarily evaluated under $L_2$ /angular equivalences; generalization to inner product (MIPS), cosine without normalization, or other metrics (e.g., learned or Mahalanobis distances) is not described.
It is unclear how ADSampling thresholds $\theta(\cdot)$ and pruning decisions adapt to different metrics; provide metric-specific derivations or calibrations.
Sparse or mixed-modal embeddings (e.g., very high-dimensional sparse text, multi-view embeddings) are not evaluated; assess whether PDX and pruning remain effective.

GPU-specific design and limits

GPU implementation details are sparse: kernel fusion strategy, memory access patterns, warp-level load balancing, use of cuBLASLt/CUTLASS, and interaction with TF32/FP16/BF16 precisions need explication.
Memory footprint and tiling strategies on GPU (for large $k$ and $d$ ) are not documented; quantify peak memory, batch sizing, and the trade-off between tiling granularity and kernel-launch overhead.
Portability across GPU architectures (NVIDIA generations, AMD/Intel GPUs) and sensitivity to math modes (TF32 vs FP32) are untested; evaluate numeric stability and performance variance.

Systems scalability and scheduling

Thread-level load imbalance in PRUNING is acknowledged but unresolved for many-core settings (e.g., ≥64 threads): evaluate advanced scheduling (work stealing, two-dimensional partitioning over points and centroids) and NUMA-aware strategies.
Distributed or multi-node indexing is not evaluated: explore how SuperKMeans composes with hierarchical k-means in distributed settings (data partitioning, centroid synchronization, communication costs).
Energy efficiency and cost-performance (perf/Watt) across CPUs and GPUs are not reported; provide analyses to guide deployment choices in cloud vs on-prem environments.

Memory footprint and precision

End-to-end memory usage (distance buffers, norms, PDX-transposed centroids) is not quantified for large $k$ and $d$ ; document peak RAM/VRAM and propose memory-saving modes (streaming, reduced precision).
Effects of lower precision (FP16/BF16/INT8) on GEMM, PRUNING thresholds, and final recall are unexplored; characterize numeric stability and potential speedups with mixed precision.

Evaluation scope and methodology

IVF performance was assessed at a single probe rate (1% of $k$ ); provide full $n_{\text{probe}}$ –recall curves to show robustness across operating points.
Sampling analysis focuses on recall and explored vectors but not on latency distributions during search; report p50/p95/p99 search latency impacts due to cluster imbalance from sampling.
The claim that $k$ -means++ is detrimental to index quality is referenced but not fully substantiated across diverse datasets and $k$ ; provide broader evidence and analyze when/why $k$ -means++ harms IVF quality.
Real-world workloads (e.g., non-stationary query distributions, multi-tenant scenarios) are not simulated; evaluate robustness when query distributions shift after indexing.

Integration with broader indexing pipelines

Interaction with PQ/OPQ training and residual quantization is untested: does SuperKMeans preserve downstream ANN accuracy and compression efficacy compared to vanilla training?
Incremental/online updates (insertions/deletions and periodic re-clustering) are not addressed; design and evaluate incremental SuperKMeans and ETR policies for continuously evolving collections.
The reported “250× faster” hierarchical $k$ -means result lacks methodological details (branching factor, tree depth, sampling per level, quality trade-offs); provide a transparent evaluation and quality benchmarks.

Practical reproducibility and portability

Results rely on highly optimized BLAS backends (IntelMKL, AOCL-BLIS, Apple AMX); report performance with vendor-neutral OpenBLAS and across compiler flags to ensure reproducibility.
Provide a detailed description of the ADSampling threshold function $\theta(\cdot)$ and “likely()” decision rule used in implementation to enable faithful re-implementation by others.
Document integration pathways with common vector databases (e.g., Milvus, FAISS IVF variants) and migration considerations (data layout conversions, memory planners, API compatibility).

Miscellaneous open directions

Explore structured random rotations (e.g., SRHT/FJLT) to reduce the cost of preprocessing versus dense $d\times d$ rotations while preserving pruning effectiveness.
Investigate learned or data-adaptive $d'$ selection (e.g., regressors on simple dataset statistics: kurtosis/skewness per dimension, intrinsic dimensionality) to predict speedups and choose strategies a priori.
Study robustness to outliers and rare clusters, and whether additional regularization or robust $k$ -means variants (trimmed, $k$ -medoids initializations) improve IVF quality under pruning.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

These applications can be deployed with the paper’s open-source implementation and current ecosystems (FAISS, cuVS, Milvus, Weaviate, etc.), leveraging SuperKMeans’ two-phase GEMM+PRUNING, Adaptive Sampling, PDX-friendly layout, and Early Termination by Recall (ETR).

Accelerated IVF index construction in vector databases
- Sectors: software (vector databases, search platforms), e-commerce, media, legal, finance, healthcare
- What: Replace vanilla k-means/coarse quantizer training with SuperKMeans to build IVF/IVF-PQ indexes 3–7x faster on CPUs and up to 4x on GPUs, with matched recall and cluster balance.
- Tools/workflows: Integrate into FAISS- or cuVS-based build pipelines; add SuperKMeans as a clusterer option in Milvus/Weaviate/Qdrant/LanceDB; provide a CLI/SDK “superkmeans index” command for ingestion nodes.
- Assumptions/dependencies: High-dimensional embeddings (typically d ≥ 384); k in the √N to 4√N range; optimized BLAS (MKL/BLIS/Accelerate/OpenBLAS compiled for the target CPU) or cuBLAS; distance metric compatible with ADSampling (L2 or cosine via normalization).
Faster reindexing and continuous refresh for RAG and semantic search
- Sectors: software (LLM/RAG platforms), enterprise search, developer tools
- What: Cut reindexing SLAs from hours to minutes; enable frequent ingestion/retraining of embeddings without degrading recall.
- Tools/workflows: Use ETR to auto-stop when recall plateaus and sample 20–30% of data for centroid training; schedule nightly/rolling re-clusters.
- Assumptions/dependencies: Representative query set (or sampled queries) to drive ETR; ground truth estimation cost is small but nonzero (4–6% runtime in paper’s profiling).
Cost and energy reduction for index building
- Sectors: cloud platforms, FinOps, sustainability
- What: Reduce CPU/GPU-hours and carbon footprint for large-scale indexing jobs (10M+ vectors), especially in multi-tenant/managed services.
- Tools/products: “Green indexing” preset using [email protected] and 20–30% sampling; Graviton/ARM and Apple Silicon builds validated by the paper.
- Assumptions/dependencies: Benefits largest at high d and large k; diminishing returns for very small k (e.g., k < 100).
Default changes in production pipelines: random init, fewer iterations
- Sectors: software (IR/ML infrastructure)
- What: Switch from k-means++ to random initialization; cap iterations at 5–10 with ETR—empirically no loss in retrieval quality.
- Tools/workflows: Update default configs in FAISS-based pipelines (e.g., IVF training iters=10, init=random).
- Assumptions/dependencies: Embedding regimes with fuzzy Voronoi cells (typical for modern LLM/CLIP embeddings); monitor recall to validate.
Aggressive yet safe sampling for faster builds
- Sectors: all using vector search at scale
- What: Train centroids with 20–30% of points while keeping recall stable; balance explored-vectors vs. build-time.
- Tools/workflows: Sampling-aware trainer that targets a fixed “vectors explored” budget; auto-tune sampling until cluster balance stabilizes.
- Assumptions/dependencies: Heavier sampling can increase explored vectors at query time if clusters become imbalanced; validate on workload.
GPU-accelerated clusterer in existing indexing services
- Sectors: cloud AI services, MLOps platforms
- What: Use SuperKMeans GPU mode (with cuVS) for multi-million vector datasets; target 4x speedups.
- Tools/workflows: Containerized “index-builder” jobs with CUDA/cuBLAS/cuVS; auto-switch CPU/GPU by data scale and queue depth.
- Assumptions/dependencies: Sufficient GPU memory; tuned kernel launch configs; consistent normalization pipeline for cosine/L2.
Hierarchical k-means turbo-builds for very large collections
- Sectors: e-commerce catalogs, social media, threat intel, logging/observability
- What: Combine SuperKMeans with hierarchical k-means for 10M+ embeddings (paper reports up to 250x vs FAISS in one setup).
- Tools/workflows: Multi-level coarse quantizer training; staged ingestion where upper levels are pre-trained and reused.
- Assumptions/dependencies: Engineering integration effort for multi-level training; careful memory management and batch scheduling.
On-device/edge indexing for personal search and media
- Sectors: daily life, mobile, edge/IoT
- What: Local semantic search over notes, photos, emails, or code snippets with reduced battery/time cost; Apple M-series and ARM validated.
- Tools/products: Mobile/desktop apps that build small IVF indexes locally; “quick reindex” button using ETR and sampling.
- Assumptions/dependencies: Smaller k reduces pruning benefits; still observe wins at high-d models; need OS-optimized BLAS.
Recommender systems: faster periodic re-clustering
- Sectors: media, retail, fintech
- What: Refresh item/user embedding clusters more frequently to capture drift without missing SLAs.
- Tools/workflows: Nightly/weekly batch jobs with ETR; evaluate recall on representative interaction queries.
- Assumptions/dependencies: Requires a query proxy (historical interactions) that is predictive of online recall.
Academic and benchmarking uses
- Sectors: academia, education
- What: Use open-source SuperKMeans to study high-d clustering, SIMD/GEMM co-design, and recall-aware early stopping.
- Tools/workflows: Course labs on CPU vs GPU performance, ablations (PDXification, initial thresholds), dataset-wide scaling (k from 10² to 10^5).
- Assumptions/dependencies: Reproduction needs tuned BLAS and careful dataset normalization.

Long-Term Applications

These opportunities require additional research, engineering, or scaling beyond the paper’s current implementation.

Streaming and incremental SuperKMeans for dynamic corpora
- Sectors: news feeds, social, security, observability
- What: Online updates to centroids with recall-aware early stopping under concept drift.
- Tools/products: Incremental clusterer with mini-batches; periodic ETR checks on rolling query windows.
- Dependencies: Algorithmic extensions for true online k-means; stability guarantees; background rebalancing for cluster skew.
Hybrid graph–partition indexes with fast re-partitioning
- Sectors: vector DBs, large search platforms
- What: Use SuperKMeans for rapid re-partitioning that improves navigability of graph-based ANN (e.g., HNSW) or SPANN-like hybrids.
- Tools/workflows: Periodic “re-shard and rewire” jobs that minimize downtime; two-phase rollout of new centroids.
- Dependencies: Integration with graph maintenance and consistency; memory headroom for dual-index phases.
Recall-aware, self-tuning vector DBs (closed-loop SLOs)
- Sectors: managed databases, cloud AI services
- What: Systems that auto-tune d′, sampling ratio, k, and ETR tolerance to meet latency/recall SLOs per-collection.
- Tools/products: Control-plane service with telemetry-driven tuning; per-tenant policies (latency budgets, energy caps).
- Dependencies: Real-time recall estimation (ground truth proxies); safe exploration; robust monitoring.
Privacy-preserving and federated indexing
- Sectors: healthcare, finance, gov
- What: Leverage random rotations and pruning-friendly layouts as part of privacy-enhancing pipelines; federated centroid training across silos.
- Tools/workflows: Secure enclaves or MPC for centroid updates; share only rotated summaries.
- Dependencies: Formal privacy guarantees; provenance and auditability; acceptable utility under rotations/sampling.
Specialized hardware co-design and kernels
- Sectors: semiconductors, cloud hardware
- What: PDX-friendly memory layouts and pruning kernels accelerated on NPUs/AMX/TPUs; ISA-level support for progressive pruning.
- Tools/products: Vendor-optimized SuperKMeans libraries; kernel fusion (GEMM + early-exit accumulation).
- Dependencies: Hardware vendor collaboration; compiler/runtime support; sustained software ecosystem adoption.
Distributed, billion-scale hierarchical SuperKMeans
- Sectors: web-scale search, foundation model infra
- What: Multi-node hierarchical training for 100M–1B vectors with recall-aware checkpoints.
- Tools/workflows: Data-parallel coarse quantizer training; elastic, failure-tolerant index builders.
- Dependencies: Partitioning strategies for communication minimization; cross-node BLAS scheduling; fault recovery.
Generalization of GEMM+PRUNING beyond k-means
- Sectors: ML platforms, analytics
- What: Apply Adaptive Sampling + progressive pruning to GMM/EM, k-medoids, RQ/OPQ codebook training, and other distance-heavy iterative methods.
- Tools/workflows: Shared pruning runtime with pluggable distance metrics; ablations across tasks.
- Dependencies: Theoretical guarantees per algorithm; metric-specific pruning thresholds; quality validation.
Standardized recall-aware early stopping across ecosystems
- Sectors: open-source IR/ML tooling
- What: ETR-like APIs in FAISS/cuVS/Scikit-learn and vector DBs, with ground-truth proxies and sampling guidance.
- Tools/products: “Recall meter” modules; benchmarks that report time-to-ETR instead of fixed-iteration timing.
- Dependencies: Agreement on proxy metrics and patience thresholds; dataset-specific calibration.
Energy-aware scheduling and policy guidance
- Sectors: cloud operations, sustainability policy
- What: Scheduling policies that route index builds to ARM/energy-efficient SKUs during off-peak; procurement guidelines favoring recall-aware early stopping.
- Tools/workflows: FinOps dashboards showing energy/time savings from SuperKMeans vs baselines; carbon reporting.
- Dependencies: Provider support for accurate energy metering; organizational policy adoption.

Notes on feasibility and transferability

Where it works best: high-dimensional embeddings (text/image/code) with large k (√N to 4√N), typical of IVF-based systems.
Where benefits narrow: small k (≤100) or low-dimensional datasets; pruning yields less advantage when centroids fit in L1/L2.
Metrics and distance: L2 and cosine (via normalization) are natural fits; other metrics may require adapted pruning thresholds.
ETR robustness: Needs a representative query set; otherwise, risk of premature stopping or overfitting to the proxy.
Sampling: 20–30% is a strong default, but verify cluster balance and “vectors explored” on your workload.
Engineering: Ensure BLAS is properly optimized for target CPUs; use dynamic scheduling for multi-threaded pruning; validate GPU memory limits for very large k.

View Paper Prompt View All Prompts

Glossary

ADSampling (Adaptive Sampling): A pruning technique that estimates distances using a small, adaptively chosen subset of dimensions to early-abort most comparisons. "Efficient pruning kernels are achieved by using Adaptive Sampling (ADSampling)~\cite{adsampling} and by keeping centroids in memory using the PDX layout~\cite{pdx}."
AMX: Apple’s matrix-multiplication coprocessor used to accelerate linear algebra on Apple Silicon. "Apple Accelerate uses AMX, a co-processor in Apple Silicon chips specialized in matrix multiplication operations~\cite{amxpaper, corsix_amx}."
AOCL-BLIS: AMD’s optimized implementation of the BLAS linear algebra API. "Some notable examples of BLAS implementations include IntelMKL for Intel CPUs, AOCL-BLIS for AMD CPUs, Apple Accelerate for Apple Silicon chips, and OpenBLAS, which can be compiled for various CPU microarchitectures."
angular distance: An angle-based distance metric (often derived from cosine similarity) used for measuring similarity between vectors. "Note: Scikit-Learn is not plotted in the datasets that need an angular distance, since this is an unsupported feature."
Apple Accelerate: Apple’s optimized math and signal-processing framework that provides BLAS and uses AMX on Apple Silicon. "Some notable examples of BLAS implementations include IntelMKL for Intel CPUs, AOCL-BLIS for AMD CPUs, Apple Accelerate for Apple Silicon chips, and OpenBLAS, which can be compiled for various CPU microarchitectures."
BLAS: A standardized API for basic linear algebra routines (vector and matrix operations) with highly optimized implementations. "BLAS libraries are accelerators for linear algebra operations tailored for modern CPUs."
cover trees: Hierarchical metric trees that enable nearest-neighbor search in general metric spaces. "Also, we shifted our focus away from methods that use auxiliary data structures for clustering (e.g., seeded search graphs~\cite{usearch, seededsearchgraphs}, SA-trees~\cite{satrees}, cover trees~\cite{covertrees}), as the costs associated with building these structures often exceed those of $k$ -means."
cuVS: A GPU-accelerated library for vector search (part of NVIDIA RAPIDS). "SuperKMeans' clustering is up to 7x faster than FAISS and Scikit-Learn on modern CPUs and up to 4x faster than cuVS on GPUs (Figure 1), while maintaining the quality of the resulting centroids for vector similarity search tasks."
DBSCAN: A density-based clustering algorithm that groups points with many close neighbors and marks outliers. "we will not cover other clustering algorithms, such as DBSCAN~\cite{dbscan} and HDBSCAN~\cite{hdbscan}, as they are not commonly used for building vector search indexes because they form clusters based on density rather than proximity to centroids."
Early Termination by Recall (ETR): An early-stopping rule for k-means that halts when retrieval recall no longer improves. "Finally, we introduce a novel early termination mechanism named Early Termination by Recall (ETR)."
Elkan's k-means: A k-means variant that uses triangle inequality bounds to prune distance computations. "One example is Elkan's $k$ -means variant, which achieves theoretical speedups through pruning but, in practice, is slower than a complete $X \boldsymbol{\cdot} Y$ GEMM for distance calculations."
FAISS: A library for efficient similarity search and clustering of dense vectors. "Despite this, the implementation of $k$ -means in modern vector systems like FAISS~\cite{faisspaper} has not evolved beyond vanilla $k$ -means (i.e., Lloyd's $k$ -means) with tuned hyperparameters."
GEMM (General Matrix Multiplication): The core matrix-matrix multiplication kernel used to accelerate distance computations via linear algebra. "This is because Lloyd's $k$ -means can leverage highly optimized General Matrix Multiplication (GEMM) routines to accelerate the pairwise distance calculation between data points and cluster centroids---the main bottleneck of $k$ -means."
HDBSCAN: A hierarchical extension of DBSCAN that discovers clusters of varying densities. "we will not cover other clustering algorithms, such as DBSCAN~\cite{dbscan} and HDBSCAN~\cite{hdbscan}, as they are not commonly used for building vector search indexes because they form clusters based on density rather than proximity to centroids."
hierarchical k-means: A recursive clustering strategy that builds a tree of centroids for scalable partitioning. "Furthermore, we combine SuperKMeans' acceleration with hierarchical $k$ -means~\cite{micronn, turbopuffer, kmeanselastic, jin2026curator, spfresh}."
intrinsic dimensionality: The effective dimensionality that governs distance behavior and concentration in data. "A higher intrinsic dimensionality also diminishes the benefits of pruning (e.g., GloVe)."
Inverted-Files (IVF): A partition-based index where vectors are assigned to clusters (lists), searching a subset of lists at query time. "This indexing method, commonly known as Inverted-Files (IVF)~\cite{pqivf, faisspaper}, is one of the most widely used indexes in vector systems \cite{faisspaper, micronn, vectordbs, milvus, quake}."
k-means++: A seeding strategy that initializes centroids to speed up convergence and improve quality. "and iii) that using $k$ -means++ initialization is detrimental for index quality."
Lloyd's algorithm: The standard iterative k-means procedure of assignment and centroid update. "The vanilla version of $k$ -means is known as Lloyd's algorithm~\cite{kmeans}."
OpenBLAS: An open-source, highly optimized BLAS implementation. "Some notable examples of BLAS implementations include IntelMKL for Intel CPUs, AOCL-BLIS for AMD CPUs, Apple Accelerate for Apple Silicon chips, and OpenBLAS, which can be compiled for various CPU microarchitectures."
PDX layout: A block-column-major data layout for centroids that improves cache access and pruning efficiency. "Efficient pruning kernels are achieved by using Adaptive Sampling (ADSampling)~\cite{adsampling} and by keeping centroids in memory using the PDX layout~\cite{pdx}."
PDXification: The transformation step that reorganizes centroids into the PDX layout. " $\tilde{Y} \gets PDXification(\tilde{Y})$ \;"
random rotation matrix: An orthogonal matrix used to rotate data so that information is more evenly spread across dimensions. " $R \gets GenerateRandomRotationMatrix(d)$ "
recall (VSS): The fraction of true nearest neighbors retrieved by an approximate search. "The quality of these retrieved answers is typically measured using the recall metric, which quantifies the proportion of retrieved vectors that are truly the closest to the query."
SA-trees: Spatial approximation trees used for efficient proximity search. "Also, we shifted our focus away from methods that use auxiliary data structures for clustering (e.g., seeded search graphs~\cite{usearch, seededsearchgraphs}, SA-trees~\cite{satrees}, cover trees~\cite{covertrees}), as the costs associated with building these structures often exceed those of $k$ -means."
seeded search graphs: Graph-based structures initialized with selected seeds to support fast approximate nearest-neighbor search. "Also, we shifted our focus away from methods that use auxiliary data structures for clustering (e.g., seeded search graphs~\cite{usearch, seededsearchgraphs}, SA-trees~\cite{satrees}, cover trees~\cite{covertrees}), as the costs associated with building these structures often exceed those of $k$ -means."
sgemm: The single-precision GEMM routine in BLAS (float32 matrix multiply). "For our purposes, we are only interested in the \codeword{sgemm} routine inside BLAS that does a \underline{s}ingle precision (\codeword{float32}) \underline{ge}neral \underline{m}atrix \underline{m}ultiply."
SIMD: Single Instruction, Multiple Data; CPU vectorization for processing multiple data elements per instruction. "efficient vectorization (SIMD), CPU/GPU-friendly algorithms, evaluation across CPU microarchitectures, comprehensive evaluation in high-dimensional embeddings datasets ( $d > 512$ ) using a high number of clusters ( $k > 10000$ ), and comparison against SOTA systems."
triangle inequality: A metric property leveraged to bound and prune distance computations. "ADSampling offers much higher, more effective pruning capabilities than the triangle inequality~\cite{pdx, adsampling, rabitq}."
vector embeddings: High-dimensional numeric representations of data (e.g., text, images) produced by models. "We present SuperKMeans: a $k$ -means variant designed for clustering collections of high-dimensional vector embeddings."
Vector similarity search (VSS): The task of finding the most similar vectors to a query under a chosen metric. "VSS consists of finding the vectors in a collection that are the most similar to a given query vector, based on a distance or similarity metric."
Voronoi cells: Regions of space where every point is closest to the same centroid; used to conceptualize cluster boundaries. "This phenomenon leads to densely populated boundaries of the Voronoi cells."
within-cluster sum of squares (WCSS): The sum of squared distances from points to their assigned centroids (also called inertia). "Note that both algorithms also yield a similar within-cluster sum of squares value (WCSS, also known as inertia)."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Collections

GitHub

GitHub - cwida/SuperKMeans: ⚡ Super fast clustering for high-dimensional vectors on CPUs (x86, ARM) and GPUs — for Python and C++. 100x faster clustering of vector embeddings than FAISS · GitHub (27 stars)

A Super Fast K-means for Indexing Vector Embeddings

Summary

SuperKMeans: Accelerating K-Means for High-Dimensional Vector Embedding Indexing

Introduction and Motivation

Algorithmic Contributions

Empirical Evaluation

Clustering Time and Quality

Scalability with k

Impact of Sampling and Iterations

Multithreading and Microarchitecture Portability

Initialization and Ablation Results

GPU Implementation

Hierarchical SuperKMeans

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

A super-fast way to group big vectors for search

What this paper is about

What questions the paper tries to answer

Key ideas and how the method works (in simple terms)

A quick reminder: what is k-means?

The core trick: “peek early, skip most”

Stop early when search quality stops improving

Small extra improvements

What did they test and how?

Main results and why they matter

Why this matters (impact)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Algorithmic guarantees and failure modes

Hyperparameters and auto-tuning

Early Termination by Recall (ETR)

Generality across metrics and data types

GPU-specific design and limits

Systems scalability and scheduling

Memory footprint and precision

Evaluation scope and methodology

Integration with broader indexing pipelines

Practical reproducibility and portability

Miscellaneous open directions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on feasibility and transferability

Glossary

Open Problems

Continue Learning

Collections

GitHub

Tweets