UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (1802.03426v3)
Abstract: UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.
Summary
- The paper introduces UMAP, a novel dimension reduction method leveraging concepts from Riemannian geometry and algebraic topology.
- It constructs fuzzy simplicial sets to model local structures and combines them into a global representation optimized via stochastic gradient descent.
- Experimental results show UMAP outperforms t-SNE with faster computation, improved scalability, and enhanced preservation of global data structure.
This paper introduces UMAP (Uniform Manifold Approximation and Projection), a dimension reduction technique designed for visualizing high-dimensional data and as a general-purpose dimension reduction tool for machine learning. It aims to improve upon existing methods like t-SNE by offering faster computation, better scalability, and potentially superior preservation of global data structure while maintaining high-quality local structure representation.
Theoretical Foundations:
UMAP is built upon concepts from Riemannian geometry and algebraic topology. The core idea is to model the data as being uniformly distributed on a potentially unknown Riemannian manifold.
- Manifold Approximation: It assumes the data lies on a manifold M with a Riemannian metric g. To handle non-uniform data distributions, UMAP locally adapts the metric at each point Xi such that the data appears locally uniform. This is achieved by scaling distances relative to the distance to the kth nearest neighbor (σi) and incorporating a shift based on the nearest neighbor distance (ρi) to ensure local connectivity.
- Fuzzy Topological Representation: These locally varying metric spaces are translated into fuzzy simplicial sets using tools from category theory (specifically, adjoint functors $\FinReal$ and $\FinSing$ between categories of fuzzy simplicial sets and metric spaces). This captures the topological structure while retaining local metric information via fuzzy membership strengths.
- Combining Local Structures: The individual fuzzy simplicial sets corresponding to each point's local view are combined using a fuzzy set union (specifically, the probabilistic t-conorm) to form a single fuzzy topological representation of the high-dimensional data.
- Optimization: A similar fuzzy topological representation is constructed for the low-dimensional embedding Y. UMAP then optimizes the positions of points in Y to minimize the fuzzy set cross-entropy between the high-dimensional and low-dimensional topological representations, effectively matching the topological structures.
Computational Implementation:
The theoretical framework translates into a practical algorithm:
- Graph Construction:
- For each point Xi, find its k nearest neighbors.
- Compute ρi (distance to the nearest neighbor) and σi (a scaling factor ensuring the sum of exponentiated scaled distances equals log2(k)).
- Construct a weighted directed graph where edge weights w((xi,xij)) from Xi to its neighbor Xij are calculated as w((xi,xij))=exp(σi−max(0,d(xi,xij)−ρi)). This represents the 1-skeleton of the local fuzzy simplicial set.
- Symmetrize the graph using the probabilistic t-conorm: B=A+A⊤−A∘A⊤, where A is the adjacency matrix and ∘ is element-wise multiplication. This results in the final weighted undirected graph G.
- Graph Layout:
- Initialization: Initialize the low-dimensional embedding Y using spectral embedding on the graph Laplacian of G.
- Optimization: Use stochastic gradient descent (SGD) to minimize the cross-entropy between the high-dimensional graph G (weights vij=Bij) and a similar graph H constructed from the low-dimensional points Y (weights wij). The low-dimensional weights wij are defined by a smooth, differentiable function wij=(1+a(∥yi−yj∥22)b)−1, where a and b are fitted based on the
min-dist
hyperparameter. - Forces: The SGD updates correspond to applying attractive forces along edges (proportional to vij) and repulsive forces between points (proportional to (1−vij)). Negative sampling is used to efficiently approximate the repulsive forces.
Hyperparameters:
n_neighbors
(n): Controls the size of the local neighborhood. Balances local detail versus global structure. Smaller n focuses on very local structure, larger n incorporates more global information.min_dist
: Controls the minimum distance between points in the low-dimensional embedding. Affects the packing density of points. Primarily an aesthetic parameter for visualization.d
: Target embedding dimension.n_epochs
: Number of optimization iterations.
Evaluation and Performance:
- Qualitative: UMAP produces embeddings comparable to t-SNE for visualizing local structure (e.g., clusters) but often preserves more global structure, as seen in datasets like MNIST, Fashion-MNIST, and COIL-20.
- Quantitative: Measured by k-NN classifier accuracy on the embeddings, UMAP performs comparably to t-SNE and LargeVis for small k (local structure) and often outperforms them for larger k (more global structure).
- Stability: UMAP embeddings show significantly higher stability under data sub-sampling compared to t-SNE and LargeVis, measured by Procrustes distance.
- Runtime: UMAP is significantly faster than t-SNE (including Barnes-Hut and FIt-SNE variants) and LargeVis across various datasets.
- Scalability:
- Embedding Dimension: UMAP scales much better than t-SNE to higher embedding dimensions (d>2).
- Ambient Dimension: UMAP handles very high ambient data dimensions effectively, often without requiring initial PCA reduction, unlike t-SNE/LargeVis.
- Number of Samples: UMAP scales efficiently to millions of data points, outperforming t-SNE variants in runtime, demonstrated on datasets up to 30 million points.
Weaknesses:
- Like other non-linear methods, UMAP dimensions lack direct interpretability compared to PCA.
- It can potentially identify structure in random noise ("constellation effect"), especially with small datasets.
- Primarily focuses on local structure; methods like MDS might be better if global distance preservation is paramount.
- Initialization via spectral embedding contributes significantly to global structure preservation.
- Performance on very small datasets (<500 samples) can be affected by approximations (ANN, negative sampling).
Future Work:
The paper suggests extensions including semi-supervised learning, handling heterogeneous data types, adding new points to existing embeddings, inverse transformations (generative models), metric learning, and improving robustness for small datasets.
Conclusion:
UMAP is presented as a powerful, fast, and scalable dimension reduction algorithm grounded in mathematical theory. It excels at visualizing complex datasets, preserving both local and global structure effectively, and serves as a viable general-purpose dimension reduction tool for various machine learning tasks, demonstrating significant performance advantages over previous state-of-the-art methods like t-SNE.
Related Papers
- Clustering with UMAP: Why and How Connectivity Matters (2021)
- Uniform Manifold Approximation and Projection (UMAP) and its Variants: Tutorial and Survey (2021)
- Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMAP, and PaCMAP for Data Visualization (2020)
- Parametric UMAP embeddings for representation and semi-supervised learning (2020)
- Stop Misusing t-SNE and UMAP for Visual Analytics (2025)