Distributed UMAP via Sketch and Scale
- The paper introduces a distributed UMAP framework that leverages local Count Sketch summarization to reduce raw data transfer and computational costs.
- Count Sketch-based summarization aggregates local sketches to efficiently extract heavy hitters while preserving theoretical error bounds and embedding fidelity.
- The framework achieves a 5–10× runtime speedup and significantly lowers memory and communication demands, enabling scalable analytics on vast datasets.
Sketch and Scale (SnS) Distributed UMAP is a distributed framework designed to enable scalable, privacy-preserving, and communication-efficient dimensionality reduction using Uniform Manifold Approximation and Projection (UMAP) across geographically distributed, high-dimensional datasets. The core mechanism leverages the Count Sketch data structure at edge nodes to generate summary statistics of local data without transferring raw points. These sketches are merged centrally to extract a compact set of representative samples, termed the "summary", on which standard UMAP is executed to produce a global low-dimensional embedding. The approach achieves linear time complexity in data size, logarithmic memory and communication requirements, and allows the embedding of datasets with hundreds of millions of points distributed over heterogeneous data centers (Wei et al., 2020).
1. System Architecture and Workflow
SnS orchestrates distributed dimensionality reduction through a two-layer architecture: multiple edge nodes, each holding a local dataset , and a single master node. The raw high-dimensional data never leaves the edge nodes. The workflow comprises the following stages:
- Local Sketching: Each edge computes a Count Sketch of its data in time, with data points and sketch depth.
- Communication: The compact sketch (size ) is sent to the master; the communication is independent of .
- Aggregation: The master computes the global sketch by summing received sketches.
- Summarization: The top- heavy-hitter bins (with the largest estimated frequencies) are extracted from . For each, at least one original data point mapped to the bin is retrieved, forming the "summary" set with elements.
- UMAP Execution: Standard UMAP is run on to yield the final embedding.
This structure ensures that the communication and computation bottlenecks typically associated with distributed high-dimensional analytics are circumvented by sketch-based summarization.
2. Count Sketch Data Structure and Theoretical Guarantees
The Count Sketch is a randomized linear hash-based structure that maintains frequency estimates for items in a large universe . For input stream frequencies , the sketch is defined by width and depth , with independent hash pairs . Updates increment by for each item .
Querying estimates for count is done via the median of over . With probability at least , the error satisfies
This error bound (Charikar–Chen–Farach-Colton, 2002) is preserved in the merged global sketch since Count Sketch is linear.
3. Sketch Merging and Accuracy Preservation
The linearity property enables direct addition of sketches from individual edges: if each edge constructs for its local frequency vector , the master computes as the sketch of the global . All guarantees, including the per-item estimate error, are preserved under this summation. This facilitates scalability and compatibility with uncoordinated, asynchronous edge nodes.
4. Heavy Hitter Extraction and Summary Formation
To select a representative subset, the master computes estimates for each bin and selects those with , where is chosen so approximately bins survive. Within each surviving bin , at least one original point with is retrieved to serve as a prototype, yielding the summary of size .
The Count Sketch error bound ensures that, except for a small set of false negatives controlled by , the densest regions of the dataset are faithfully retained in the summary. The algorithmic steps are:
| Step | Description | Purpose |
|---|---|---|
| 1 | Compute for all | Estimate bin frequencies |
| 2 | Retain bins with | Identify heavy hitters |
| 3 | For each, extract original | Construct summary |
This summary method reduces subsequent computation and communication for UMAP to depend only on .
5. UMAP Execution on the Summary
UMAP is applied to as a self-contained dataset. The core steps, matching the standard algorithm [McInnes–Healy–Melville 2018], are:
- Build a weighted -nearest-neighbor graph on , with per-point neighborhood scaling based on a user-specified neighbor count.
- Compute asymmetric fuzzy memberships and symmetrize into a fuzzy simplicial set with weights .
- Randomly initialize the -dimensional embedding .
- Optimize the cross-entropy-based objective
where , via stochastic gradient descent until convergence.
This stage induces low-dimensional representations for the summary points, mapping the densest regions of the original distributed dataset.
6. Computational and Communication Complexity
Let denote the global point count, the feature dimension, the number of edge nodes, and the summary size. The per-node computational cost for Count Sketch is (with extra if random projection is used). Memory and communication per edge is —independent of . The master merges sketches in , scans up to bins for summary extraction, and UMAP on completes in .
Consequently, both time and communication per edge scale as , with no dependence on . This ensures scalability to datasets of arbitrary size and distribution.
7. Empirical Evaluation and Outcomes
SnS was validated on two large-scale real-world collections: a cancer RNA-seq dataset with 75 million samples at 500 dimensions, and a Sloan Digital Sky Survey (SDSS) catalog with 100 million entries at 100 dimensions. Reported outcomes include:
- 5–10× speedup in end-to-end runtime compared to single-node and naïve distributed UMAP benchmarks.
- Memory usage per node drops from tens of gigabytes (GB) to a few hundred megabytes (MB).
- Communication per node remains under 100 MB, scaling logarithmically in .
- Embedding quality, measured via KNN-preservation and trustworthiness, is within 2–5% of single-node UMAP operating on the full dataset.
These results indicate that SnS enables scalable, efficient, and distributed UMAP-based analytics across geo-distributed datasets, while preserving embedding fidelity and drastically reducing system requirements (Wei et al., 2020).