Distributed UMAP via Sketch and Scale

Updated 11 January 2026

The paper introduces a distributed UMAP framework that leverages local Count Sketch summarization to reduce raw data transfer and computational costs.
Count Sketch-based summarization aggregates local sketches to efficiently extract heavy hitters while preserving theoretical error bounds and embedding fidelity.
The framework achieves a 5–10× runtime speedup and significantly lowers memory and communication demands, enabling scalable analytics on vast datasets.

Sketch and Scale (SnS) Distributed UMAP is a distributed framework designed to enable scalable, privacy-preserving, and communication-efficient dimensionality reduction using Uniform Manifold Approximation and Projection (UMAP) across geographically distributed, high-dimensional datasets. The core mechanism leverages the Count Sketch data structure at edge nodes to generate summary statistics of local data without transferring raw points. These sketches are merged centrally to extract a compact set of representative samples, termed the "summary", on which standard UMAP is executed to produce a global low-dimensional embedding. The approach achieves linear time complexity in data size, logarithmic memory and communication requirements, and allows the embedding of datasets with hundreds of millions of points distributed over heterogeneous data centers (Wei et al., 2020).

1. System Architecture and Workflow

SnS orchestrates distributed dimensionality reduction through a two-layer architecture: multiple edge nodes, each holding a local dataset $\mathcal{D}_i = \{x_{i,1}, ..., x_{i,n_i}\} \subset \mathbb{R}^d$ , and a single master node. The raw high-dimensional data never leaves the edge nodes. The workflow comprises the following stages:

Local Sketching: Each edge computes a Count Sketch of its data in $O(n_i t)$ time, with $n_i$ data points and $t$ sketch depth.
Communication: The compact sketch $S_i$ (size $O(1/\varepsilon^2 \log 1/\delta)$ ) is sent to the master; the communication is independent of $n_i$ .
Aggregation: The master computes the global sketch $S = \sum_i S_i$ by summing received sketches.
Summarization: The top- $m$ heavy-hitter bins (with the largest estimated frequencies) are extracted from $S$ . For each, at least one original data point mapped to the bin is retrieved, forming the "summary" set $\mathcal{S}$ with $m \ll \sum_i n_i$ elements.
UMAP Execution: Standard UMAP is run on $\mathcal{S}$ to yield the final embedding.

This structure ensures that the communication and computation bottlenecks typically associated with distributed high-dimensional analytics are circumvented by sketch-based summarization.

2. Count Sketch Data Structure and Theoretical Guarantees

The Count Sketch is a randomized linear hash-based structure that maintains frequency estimates for items in a large universe $\{1,\ldots,N\}$ . For input stream frequencies $f \in \mathbb{R}^N$ , the sketch $C \in \mathbb{R}^{t \times w}$ is defined by width $w = \lceil c / \varepsilon^2 \rceil$ and depth $t = \lceil \ln(1/\delta) \rceil$ , with $t$ independent hash pairs $(h_r, g_r)$ . Updates increment $C_{r,h_r(j)}$ by $g_r(j)$ for each item $j$ .

Querying estimates $\widehat{f}_j$ for count $f_j$ is done via the median of $g_r(j) C_{r,h_r(j)}$ over $r$ . With probability at least $1-\delta$ , the error satisfies

$|\widehat{f}_j - f_j| \leq \varepsilon \|f\|_2.$

This error bound (Charikar–Chen–Farach-Colton, 2002) is preserved in the merged global sketch since Count Sketch is linear.

3. Sketch Merging and Accuracy Preservation

The linearity property enables direct addition of sketches from individual edges: if each edge constructs $C^{(i)}$ for its local frequency vector $f^{(i)}$ , the master computes $C = \sum_{i} C^{(i)}$ as the sketch of the global $f = \sum_{i} f^{(i)}$ . All $(\varepsilon, \delta)$ guarantees, including the per-item estimate error, are preserved under this summation. This facilitates scalability and compatibility with uncoordinated, asynchronous edge nodes.

4. Heavy Hitter Extraction and Summary Formation

To select a representative subset, the master computes estimates $\widehat{f}_j$ for each bin and selects those with $\widehat{f}_j \geq \tau$ , where $\tau$ is chosen so approximately $m$ bins survive. Within each surviving bin $j$ , at least one original point $x \in \mathcal{D}$ with $h(x) = j$ is retrieved to serve as a prototype, yielding the summary $\mathcal{S}$ of size $m$ .

The Count Sketch error bound ensures that, except for a small set of false negatives controlled by $\varepsilon$ , the densest regions of the dataset are faithfully retained in the summary. The algorithmic steps are:

Step	Description	Purpose
1	Compute $\widehat{f}_j$ for all $j=1..N$	Estimate bin frequencies
2	Retain bins with $\widehat{f}_j \geq \tau$	Identify heavy hitters
3	For each, extract original $x \in \mathcal{D}$	Construct summary $\mathcal{S}$

This summary method reduces subsequent computation and communication for UMAP to depend only on $m \ll n$ .

5. UMAP Execution on the Summary

UMAP is applied to $\mathcal{S}$ as a self-contained dataset. The core steps, matching the standard algorithm [McInnes–Healy–Melville 2018], are:

Build a weighted $k$ -nearest-neighbor graph on $\mathcal{S}$ , with per-point neighborhood scaling $\sigma_i$ based on a user-specified neighbor count.
Compute asymmetric fuzzy memberships and symmetrize into a fuzzy simplicial set with weights $w_{ij}$ .
Randomly initialize the $p$ -dimensional embedding $Y$ .
Optimize the cross-entropy-based objective

$L = \sum_{(i,j) \in E} w_{ij} \log(\sigma(\|y_i - y_j\|)) + (1-w_{ij}) \log(1 - \sigma(\|y_i - y_j\|)),$

where $\sigma(t) = 1 / (1 + a t^{2b})$ , via stochastic gradient descent until convergence.

This stage induces low-dimensional representations for the summary points, mapping the densest regions of the original distributed dataset.

6. Computational and Communication Complexity

Let $n = \sum_i n_i$ denote the global point count, $d$ the feature dimension, $k$ the number of edge nodes, and $m$ the summary size. The per-node computational cost for Count Sketch is $O(n_i t)$ (with $O(n_i d)$ extra if random projection is used). Memory and communication per edge is $O(1/\varepsilon^2 \log 1/\delta)$ —independent of $n_i$ . The master merges $k$ sketches in $O(kwt)$ , scans up to $O(n)$ bins for summary extraction, and UMAP on $\mathcal{S}$ completes in $O(m\log m + \text{epochs} \, m k)$ .

Consequently, both time and communication per edge scale as $O(n_i + 1/\varepsilon^2 \log 1/\delta)$ , with no dependence on $n$ . This ensures scalability to datasets of arbitrary size and distribution.

7. Empirical Evaluation and Outcomes

SnS was validated on two large-scale real-world collections: a cancer RNA-seq dataset with $\approx$ 75 million samples at 500 dimensions, and a Sloan Digital Sky Survey (SDSS) catalog with $\approx$ 100 million entries at 100 dimensions. Reported outcomes include:

5–10× speedup in end-to-end runtime compared to single-node and naïve distributed UMAP benchmarks.
Memory usage per node drops from tens of gigabytes (GB) to a few hundred megabytes (MB).
Communication per node remains under 100 MB, scaling logarithmically in $n$ .
Embedding quality, measured via KNN-preservation and trustworthiness, is within 2–5% of single-node UMAP operating on the full dataset.

These results indicate that SnS enables scalable, efficient, and distributed UMAP-based analytics across geo-distributed datasets, while preserving embedding fidelity and drastically reducing system requirements (Wei et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Sketch and Scale: Geo-distributed tSNE and UMAP (2020)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Sketch and Scale (SnS) Distributed UMAP.