Parametric aUMAP: Scalable Manifold Embeddings

Updated 16 December 2025

Parametric UMAP is a family of scalable algorithms that extend UMAP by rapidly projecting new data points while preserving topological and geometric fidelity.
The approach utilizes methods like k-NN weighted interpolation, Count Sketch-based compression, and neural mapping to reduce computation and communication overhead.
Empirical benchmarks demonstrate sub-millisecond projection times and high preservation of local structure, making it suitable for real-time, streaming, and distributed applications.

Approximate UMAP (aUMAP) is a family of communication- and computation-efficient variants of Uniform Manifold Approximation and Projection (UMAP), designed for scalable, real-time, or distributed dimensionality reduction and data visualization. These algorithms maintain the topological and geometric characteristics achieved by UMAP while dramatically reducing the computational or communication cost during the projection of new data points (“out-of-sample” embeddings), making them suitable for large-scale, streaming, or distributed scenarios (Wassenaar et al., 2024, Wei et al., 2020, Ben-Ari et al., 20 Jan 2025).

1. Motivation and Overview

Standard UMAP constructs low-dimensional embeddings by building a high-dimensional $k$ -nearest-neighbors (k-NN) graph, forming a fuzzy simplicial set of probabilistic affinities, and optimizing the low-dimensional representation via stochastic gradient descent (SGD) with a specific cross-entropy loss. While effective for manifold learning on moderately sized data, standard UMAP is computationally intensive for both fitting and, critically, for projecting new points into an existing embedding. Approximate UMAP (aUMAP) denotes algorithms that accelerate this out-of-sample projection by substituting or augmenting UMAP’s optimization step with analytical or sketch-based methods—achieving drastically lower projection latency or communication overhead, especially in streaming, distributed, or online visualization settings (Wassenaar et al., 2024, Wei et al., 2020).

2. Core aUMAP Algorithms

Several distinct approaches fall under the aUMAP umbrella. Representative families include:

k-NN-based Regression (real-time projection): Trains a standard UMAP embedding on a reference set, then learns a $k$ -NN model (e.g., KD-tree, Ball-tree) in the original space. Projection of a novel point $x^*$ is performed as a weighted interpolation of the reference embeddings:

$w_{i_j} = \frac{1/d_{i_j}}{\sum_{\ell=1}^k 1/d_{i_\ell}},\quad u^* = \sum_{j=1}^k w_{i_j} u_{i_j}$

where $\{d_{i_j}\}$ are distances from $x^*$ to its $k$ nearest training points and $\{u_{i_j}\}$ are their low-dimensional coordinates (Wassenaar et al., 2024).

Sketch and Scale (SnS) Distributed UMAP: Designed for geo-distributed settings, SnS compresses local k-NN graphs using a Count Sketch data structure. Each edge device computes local k-NN graphs, encodes weighted edges into an (s × ℓ) Count Sketch, and transmits the sketch to an aggregator. The server reconstructs “heavy hitter” edges (dominant neighborhood relationships) and feeds the recovered sparse fuzzy graph into a standard UMAP optimizer, yielding a high-fidelity, communication-light global embedding (Wei et al., 2020).
Neural (NUMAP): Scalable Parametric UMAP via SpectralNet/GrEASE: This variant replaces spectral initialization and/or fine-tuning with scalable, trainable deep networks, supporting generalization and fast out-of-sample embeddings but is distinct from k-NN-based aUMAP (Ben-Ari et al., 20 Jan 2025).

3. Algorithmic Pipeline and Mathematical Formulation

aUMAP via k-NN-Weighted Interpolation

Training: Identical to standard UMAP. Fit UMAP on reference data $X$ , producing low-dimensional embedding $Y$ , and fit a $k$ -NN index on $X$ .
Projection:

Given $x^*$ , query the $k$ nearest points $\{x_{i_j}\}$ and their distances $d_{i_j}$ .
Compute weights $w_{i_j}$ as above (inverse distance, normalized).
Set $u^* = \sum_j w_{i_j} u_{i_j}$ ; return as the embedding for $x^*$ .

Complexity: Per-point projection $O(\log n\,D + k d)$ ; no SGD, sub-millisecond latency (Wassenaar et al., 2024).

Local (Edge) Node:

Compute local k-NN and UMAP membership weights: $\mu_{q|p} = \exp(-(\|x_p-x_q\| - \rho_p) / \sigma_p)$ .
Encode each weighted edge $(p\to q)$ into a Count Sketch $S^{(i)}$ .
Transmit sketch to central server.

Server:

Aggregate sketches: $S = \sum_{i=1}^m S^{(i)}$ .
Extract $H$ heavy edges; reconstruct sparse fuszy global graph.
Feed graph into UMAP optimizer.

Key Equations:

$S[\ell, h_\ell(e)] \mathrel{+}= g_\ell(e) \mu_{q|p},\qquad \hat{w}(e) = \text{median}_{\ell=1}^{\ell} \{g_\ell(e) S[\ell, h_\ell(e)]\}$

Complexity: Edge node— $O(n_i(T_{ANN}+k\ell))$ time, $O(s\ell)$ memory and communication. Server aggregates sketches in $O(m s\ell)$ time, heavy-hitter decoding $O(s\ell+H\ell)$ (Wei et al., 2020).

Summary Table: aUMAP Variants

aUMAP Variant	Projection Mechanism	Main Use Case
k-NN Interpolation	Weighted neighbor regression	Real-time streams
SnS–Count Sketch	L0-compressed global k-NN graph	Geo-distributed data
NUMAP	Neural OOSE parametric mapping	Generalization, latency

4. Theoretical and Practical Performance

Fidelity and Trade-offs

Empirical evaluations indicate that k-NN-based aUMAP matches the geometric “shape” of standard UMAP embeddings, with mean out-of-sample distortion generally much less than one standard deviation— ${0.08-0.26}$ for common datasets—while dropping per-point projection time from tens of milliseconds to sub-millisecond without sacrificing training speed or requiring neural infrastructure (Wassenaar et al., 2024). SnS aUMAP exhibits near-identical trustworthiness and cluster separation to UMAP up to 50–100 million points, while reducing communication per node by several orders of magnitude (Wei et al., 2020).

Complexity and Scaling

aUMAP projection (k-NN) is $O(\log n\,D + k\,d)$ per point, enabling use in high-throughput streaming or embedded environments, as opposed to UMAP’s $O(\log n\,D + t\,k)$ per-point SGD. For SnS, linear scaling in nodes and only $O(s\ell)$ communication per edge device allows billion-point, globally partitioned datasets to be embedded centrally.

Limitations

aUMAP may introduce sharper outliers for points lying near class or decision boundaries, as it lacks optimization-based neighborhood correction. In such boundary cases, neural or re-optimized parametric methods (e.g., pUMAP or NUMAP) may offer improved local fidelity, at higher infrastructure or training cost.

5. Implementation Guidelines and Parameter Choices

k-NN Interpolation: Use same $k$ as UMAP (default 15). Distance metric and feature normalization should match those used in fitting.
Sketch and Scale:
- Sketch depth $\ell$ : $3$–$5$ rows suffices for low estimator variance.
- Sketch width $s$ : $5k$–$20k$ (e.g., $k=15$ , $s=100$ –$300$) achieves nearly complete edge recovery.
- Heavy-edge budget $H$ : $n k$ or $1.2 n k$; ensures all major edges are retained.
- Edge node partitioning: By data locality; communication and runtime are nearly linear in $m$ up to hardware/network bottlenecks.
Batch query optimization: For high $D$ or $n$ , tune KD-tree leaf size to balance speed and neighbor accuracy (Wassenaar et al., 2024).

6. Empirical Benchmarking and Use Cases

k-NN aUMAP Benchmarks

Dataset	Training Time (s)	Projection (500 pts)	Fidelity (mean dist.)
Iris	$\approx 12$	0.08 s (0.16 ms/pt)	0.256
Digits	$\approx 12$	0.10 s (0.20 ms/pt)	0.083
Breast Cancer	$\approx 12$	0.10 s (0.20 ms/pt)	0.126

Cancer tissue ( $\sim50$ M points): End-to-end time $\sim1$ hr, trustworthiness within $0.005$ of standard UMAP at $8$M scale, communication per node $8$ MB.
SDSS-DR12 ( $\sim110$ M points): $2$ hr runtime, $<2$ GB per node, $99\%$ cluster fidelity to small-scale UMAP.

Typical applications: real-time BCI feedback, rapid neural latent-space visualization, geo-distributed scientific data analysis (Wassenaar et al., 2024, Wei et al., 2020).

aUMAP should be distinguished from:

Parametric UMAP (pUMAP): Uses neural nets for OOSE, matching local structure but does not retain UMAP’s spectral initialization, yielding poorer global consistency (Ben-Ari et al., 20 Jan 2025).
NUMAP (Neural UMAP via GrEASE): Achieves parametric UMAP with spectral consistency and analytical eigenvector separation; supports scalable, generalizable embeddings but with increased implementation complexity and training time (Ben-Ari et al., 20 Jan 2025).
Standard UMAP: Remains preferred for offline settings where projection speed is not a bottleneck.

Trade-offs: For lowest latency with minimal infrastructure, aUMAP (k-NN) offers the best projection speed on CPU, at the cost of occasional outlier artifacts. For maximal fidelity and trainable parametric extension, NUMAP is optimal but requires significant additional computation (Wassenaar et al., 2024, Ben-Ari et al., 20 Jan 2025).

Approximate UMAP encompasses a set of scalable, efficient algorithms for accelerating high-dimensional manifold learning and visualization, adapting UMAP to large-scale, streaming, and distributed settings without forfeiting the fidelity of the embedding for most scientific and machine learning applications (Wassenaar et al., 2024, Wei et al., 2020, Ben-Ari et al., 20 Jan 2025).

PDF Markdown Chat (Pro)

References (3)

Approximate UMAP allows for high-rate online visualization of high-dimensional data streams (2024)

Sketch and Scale: Geo-distributed tSNE and UMAP (2020)

Generalizable Spectral Embedding with an Application to UMAP (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Parametric UMAP.