Papers
Topics
Authors
Recent
Search
2000 character limit reached

Distributed UMAP via Sketch and Scale

Updated 11 January 2026
  • The paper introduces a distributed UMAP framework that leverages local Count Sketch summarization to reduce raw data transfer and computational costs.
  • Count Sketch-based summarization aggregates local sketches to efficiently extract heavy hitters while preserving theoretical error bounds and embedding fidelity.
  • The framework achieves a 5–10× runtime speedup and significantly lowers memory and communication demands, enabling scalable analytics on vast datasets.

Sketch and Scale (SnS) Distributed UMAP is a distributed framework designed to enable scalable, privacy-preserving, and communication-efficient dimensionality reduction using Uniform Manifold Approximation and Projection (UMAP) across geographically distributed, high-dimensional datasets. The core mechanism leverages the Count Sketch data structure at edge nodes to generate summary statistics of local data without transferring raw points. These sketches are merged centrally to extract a compact set of representative samples, termed the "summary", on which standard UMAP is executed to produce a global low-dimensional embedding. The approach achieves linear time complexity in data size, logarithmic memory and communication requirements, and allows the embedding of datasets with hundreds of millions of points distributed over heterogeneous data centers (Wei et al., 2020).

1. System Architecture and Workflow

SnS orchestrates distributed dimensionality reduction through a two-layer architecture: multiple edge nodes, each holding a local dataset Di={xi,1,...,xi,ni}Rd\mathcal{D}_i = \{x_{i,1}, ..., x_{i,n_i}\} \subset \mathbb{R}^d, and a single master node. The raw high-dimensional data never leaves the edge nodes. The workflow comprises the following stages:

  1. Local Sketching: Each edge computes a Count Sketch of its data in O(nit)O(n_i t) time, with nin_i data points and tt sketch depth.
  2. Communication: The compact sketch SiS_i (size O(1/ε2log1/δ)O(1/\varepsilon^2 \log 1/\delta)) is sent to the master; the communication is independent of nin_i.
  3. Aggregation: The master computes the global sketch S=iSiS = \sum_i S_i by summing received sketches.
  4. Summarization: The top-mm heavy-hitter bins (with the largest estimated frequencies) are extracted from SS. For each, at least one original data point mapped to the bin is retrieved, forming the "summary" set S\mathcal{S} with minim \ll \sum_i n_i elements.
  5. UMAP Execution: Standard UMAP is run on S\mathcal{S} to yield the final embedding.

This structure ensures that the communication and computation bottlenecks typically associated with distributed high-dimensional analytics are circumvented by sketch-based summarization.

2. Count Sketch Data Structure and Theoretical Guarantees

The Count Sketch is a randomized linear hash-based structure that maintains frequency estimates for items in a large universe {1,,N}\{1,\ldots,N\}. For input stream frequencies fRNf \in \mathbb{R}^N, the sketch CRt×wC \in \mathbb{R}^{t \times w} is defined by width w=c/ε2w = \lceil c / \varepsilon^2 \rceil and depth t=ln(1/δ)t = \lceil \ln(1/\delta) \rceil, with tt independent hash pairs (hr,gr)(h_r, g_r). Updates increment Cr,hr(j)C_{r,h_r(j)} by gr(j)g_r(j) for each item jj.

Querying estimates f^j\widehat{f}_j for count fjf_j is done via the median of gr(j)Cr,hr(j)g_r(j) C_{r,h_r(j)} over rr. With probability at least 1δ1-\delta, the error satisfies

f^jfjεf2.|\widehat{f}_j - f_j| \leq \varepsilon \|f\|_2.

This error bound (Charikar–Chen–Farach-Colton, 2002) is preserved in the merged global sketch since Count Sketch is linear.

3. Sketch Merging and Accuracy Preservation

The linearity property enables direct addition of sketches from individual edges: if each edge constructs C(i)C^{(i)} for its local frequency vector f(i)f^{(i)}, the master computes C=iC(i)C = \sum_{i} C^{(i)} as the sketch of the global f=if(i)f = \sum_{i} f^{(i)}. All (ε,δ)(\varepsilon, \delta) guarantees, including the per-item estimate error, are preserved under this summation. This facilitates scalability and compatibility with uncoordinated, asynchronous edge nodes.

4. Heavy Hitter Extraction and Summary Formation

To select a representative subset, the master computes estimates f^j\widehat{f}_j for each bin and selects those with f^jτ\widehat{f}_j \geq \tau, where τ\tau is chosen so approximately mm bins survive. Within each surviving bin jj, at least one original point xDx \in \mathcal{D} with h(x)=jh(x) = j is retrieved to serve as a prototype, yielding the summary S\mathcal{S} of size mm.

The Count Sketch error bound ensures that, except for a small set of false negatives controlled by ε\varepsilon, the densest regions of the dataset are faithfully retained in the summary. The algorithmic steps are:

Step Description Purpose
1 Compute f^j\widehat{f}_j for all j=1..Nj=1..N Estimate bin frequencies
2 Retain bins with f^jτ\widehat{f}_j \geq \tau Identify heavy hitters
3 For each, extract original xDx \in \mathcal{D} Construct summary S\mathcal{S}

This summary method reduces subsequent computation and communication for UMAP to depend only on mnm \ll n.

5. UMAP Execution on the Summary

UMAP is applied to S\mathcal{S} as a self-contained dataset. The core steps, matching the standard algorithm [McInnes–Healy–Melville 2018], are:

  • Build a weighted kk-nearest-neighbor graph on S\mathcal{S}, with per-point neighborhood scaling σi\sigma_i based on a user-specified neighbor count.
  • Compute asymmetric fuzzy memberships and symmetrize into a fuzzy simplicial set with weights wijw_{ij}.
  • Randomly initialize the pp-dimensional embedding YY.
  • Optimize the cross-entropy-based objective

L=(i,j)Ewijlog(σ(yiyj))+(1wij)log(1σ(yiyj)),L = \sum_{(i,j) \in E} w_{ij} \log(\sigma(\|y_i - y_j\|)) + (1-w_{ij}) \log(1 - \sigma(\|y_i - y_j\|)),

where σ(t)=1/(1+at2b)\sigma(t) = 1 / (1 + a t^{2b}), via stochastic gradient descent until convergence.

This stage induces low-dimensional representations for the summary points, mapping the densest regions of the original distributed dataset.

6. Computational and Communication Complexity

Let n=inin = \sum_i n_i denote the global point count, dd the feature dimension, kk the number of edge nodes, and mm the summary size. The per-node computational cost for Count Sketch is O(nit)O(n_i t) (with O(nid)O(n_i d) extra if random projection is used). Memory and communication per edge is O(1/ε2log1/δ)O(1/\varepsilon^2 \log 1/\delta)—independent of nin_i. The master merges kk sketches in O(kwt)O(kwt), scans up to O(n)O(n) bins for summary extraction, and UMAP on S\mathcal{S} completes in O(mlogm+epochsmk)O(m\log m + \text{epochs} \, m k).

Consequently, both time and communication per edge scale as O(ni+1/ε2log1/δ)O(n_i + 1/\varepsilon^2 \log 1/\delta), with no dependence on nn. This ensures scalability to datasets of arbitrary size and distribution.

7. Empirical Evaluation and Outcomes

SnS was validated on two large-scale real-world collections: a cancer RNA-seq dataset with \approx75 million samples at 500 dimensions, and a Sloan Digital Sky Survey (SDSS) catalog with \approx100 million entries at 100 dimensions. Reported outcomes include:

  • 5–10× speedup in end-to-end runtime compared to single-node and naïve distributed UMAP benchmarks.
  • Memory usage per node drops from tens of gigabytes (GB) to a few hundred megabytes (MB).
  • Communication per node remains under 100 MB, scaling logarithmically in nn.
  • Embedding quality, measured via KNN-preservation and trustworthiness, is within 2–5% of single-node UMAP operating on the full dataset.

These results indicate that SnS enables scalable, efficient, and distributed UMAP-based analytics across geo-distributed datasets, while preserving embedding fidelity and drastically reducing system requirements (Wei et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sketch and Scale (SnS) Distributed UMAP.