UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (1802.03426v3)

Published 9 Feb 2018 in stat.ML, cs.CG, and cs.LG

Abstract: UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.

Citations (8,444)

View on Semantic Scholar

Summary

The paper introduces UMAP, a novel dimension reduction method leveraging concepts from Riemannian geometry and algebraic topology.
It constructs fuzzy simplicial sets to model local structures and combines them into a global representation optimized via stochastic gradient descent.
Experimental results show UMAP outperforms t-SNE with faster computation, improved scalability, and enhanced preservation of global data structure.

This paper introduces UMAP (Uniform Manifold Approximation and Projection), a dimension reduction technique designed for visualizing high-dimensional data and as a general-purpose dimension reduction tool for machine learning. It aims to improve upon existing methods like t-SNE by offering faster computation, better scalability, and potentially superior preservation of global data structure while maintaining high-quality local structure representation.

Theoretical Foundations:

UMAP is built upon concepts from Riemannian geometry and algebraic topology. The core idea is to model the data as being uniformly distributed on a potentially unknown Riemannian manifold.

Manifold Approximation: It assumes the data lies on a manifold $\mathcal{M}$ with a Riemannian metric $g$ . To handle non-uniform data distributions, UMAP locally adapts the metric at each point $X_i$ such that the data appears locally uniform. This is achieved by scaling distances relative to the distance to the $k^{th}$ nearest neighbor ( $\sigma_i$ ) and incorporating a shift based on the nearest neighbor distance ( $\rho_i$ ) to ensure local connectivity.
Fuzzy Topological Representation: These locally varying metric spaces are translated into fuzzy simplicial sets using tools from category theory (specifically, adjoint functors $\FinReal$ and $\FinSing$ between categories of fuzzy simplicial sets and metric spaces). This captures the topological structure while retaining local metric information via fuzzy membership strengths.
Combining Local Structures: The individual fuzzy simplicial sets corresponding to each point's local view are combined using a fuzzy set union (specifically, the probabilistic t-conorm) to form a single fuzzy topological representation of the high-dimensional data.
Optimization: A similar fuzzy topological representation is constructed for the low-dimensional embedding $Y$ . UMAP then optimizes the positions of points in $Y$ to minimize the fuzzy set cross-entropy between the high-dimensional and low-dimensional topological representations, effectively matching the topological structures.

Computational Implementation:

The theoretical framework translates into a practical algorithm:

Graph Construction:
- For each point $X_i$ , find its $k$ nearest neighbors.
- Compute $\rho_i$ (distance to the nearest neighbor) and $\sigma_i$ (a scaling factor ensuring the sum of exponentiated scaled distances equals $\log_2(k)$ ).
- Construct a weighted directed graph where edge weights $w((x_i, x_{i_j}))$ from $X_i$ to its neighbor $X_{i_j}$ are calculated as $w((x_i, x_{i_j})) = \exp\left(\frac{-\max(0, d(x_i, x_{i_j}) - \rho_i)}{\sigma_i}\right)$ . This represents the 1-skeleton of the local fuzzy simplicial set.
- Symmetrize the graph using the probabilistic t-conorm: $B = A + A^\top - A \circ A^\top$ , where $A$ is the adjacency matrix and $\circ$ is element-wise multiplication. This results in the final weighted undirected graph $G$ .
Graph Layout:
- Initialization: Initialize the low-dimensional embedding $Y$ using spectral embedding on the graph Laplacian of $G$ .
- Optimization: Use stochastic gradient descent (SGD) to minimize the cross-entropy between the high-dimensional graph $G$ (weights $v_{ij} = B_{ij}$ ) and a similar graph $H$ constructed from the low-dimensional points $Y$ (weights $w_{ij}$ ). The low-dimensional weights $w_{ij}$ are defined by a smooth, differentiable function $w_{ij} = \left(1 + a (\|y_i - y_j \|_2^2)^{b}\right)^{-1}$ , where $a$ and $b$ are fitted based on the min-dist hyperparameter.
- Forces: The SGD updates correspond to applying attractive forces along edges (proportional to $v_{ij}$ ) and repulsive forces between points (proportional to $(1-v_{ij})$ ). Negative sampling is used to efficiently approximate the repulsive forces.

Hyperparameters:

n_neighbors ( $n$ ): Controls the size of the local neighborhood. Balances local detail versus global structure. Smaller $n$ focuses on very local structure, larger $n$ incorporates more global information.
min_dist: Controls the minimum distance between points in the low-dimensional embedding. Affects the packing density of points. Primarily an aesthetic parameter for visualization.
d: Target embedding dimension.
n_epochs: Number of optimization iterations.

Evaluation and Performance:

Qualitative: UMAP produces embeddings comparable to t-SNE for visualizing local structure (e.g., clusters) but often preserves more global structure, as seen in datasets like MNIST, Fashion-MNIST, and COIL-20.
Quantitative: Measured by k-NN classifier accuracy on the embeddings, UMAP performs comparably to t-SNE and LargeVis for small $k$ (local structure) and often outperforms them for larger $k$ (more global structure).
Stability: UMAP embeddings show significantly higher stability under data sub-sampling compared to t-SNE and LargeVis, measured by Procrustes distance.
Runtime: UMAP is significantly faster than t-SNE (including Barnes-Hut and FIt-SNE variants) and LargeVis across various datasets.
Scalability:
- Embedding Dimension: UMAP scales much better than t-SNE to higher embedding dimensions ( $d > 2$ ).
- Ambient Dimension: UMAP handles very high ambient data dimensions effectively, often without requiring initial PCA reduction, unlike t-SNE/LargeVis.
- Number of Samples: UMAP scales efficiently to millions of data points, outperforming t-SNE variants in runtime, demonstrated on datasets up to 30 million points.

Weaknesses:

Like other non-linear methods, UMAP dimensions lack direct interpretability compared to PCA.
It can potentially identify structure in random noise ("constellation effect"), especially with small datasets.
Primarily focuses on local structure; methods like MDS might be better if global distance preservation is paramount.
Initialization via spectral embedding contributes significantly to global structure preservation.
Performance on very small datasets (<500 samples) can be affected by approximations (ANN, negative sampling).

Future Work:

The paper suggests extensions including semi-supervised learning, handling heterogeneous data types, adding new points to existing embeddings, inverse transformations (generative models), metric learning, and improving robustness for small datasets.

Conclusion:

UMAP is presented as a powerful, fast, and scalable dimension reduction algorithm grounded in mathematical theory. It excels at visualizing complex datasets, preserving both local and global structure effectively, and serves as a viable general-purpose dimension reduction tool for various machine learning tasks, demonstrating significant performance advantages over previous state-of-the-art methods like t-SNE.

PDF Markdown

Related Papers

Tweets

https://twitter.com/metricausa/status/1794107392872865974

https://twitter.com/pseudoSergius/status/1788195714767466513

https://twitter.com/lpachter/status/1761278040364195851

https://twitter.com/ecolibutneuro/status/1760385320657244514

https://twitter.com/sam_remedios/status/1805229091404026285

https://twitter.com/movingsloth/status/1882296589580120237

YouTube

Show All Videos