Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Coresets for Clustering with Missing Values (2106.16112v3)

Published 30 Jun 2021 in cs.DS

Abstract: We provide the first coreset for clustering points in $\mathbb{R}d$ that have multiple missing values (coordinates). Previous coreset constructions only allow one missing coordinate. The challenge in this setting is that objective functions, like $k$-Means, are evaluated only on the set of available (non-missing) coordinates, which varies across points. Recall that an $\epsilon$-coreset of a large dataset is a small proxy, usually a reweighted subset of points, that $(1+\epsilon)$-approximates the clustering objective for every possible center set. Our coresets for $k$-Means and $k$-Median clustering have size $(jk){O(\min(j,k))} (\epsilon{-1} d \log n)2$, where $n$ is the number of data points, $d$ is the dimension and $j$ is the maximum number of missing coordinates for each data point. We further design an algorithm to construct these coresets in near-linear time, and consequently improve a recent quadratic-time PTAS for $k$-Means with missing values [Eiben et al., SODA 2021] to near-linear time. We validate our coreset construction, which is based on importance sampling and is easy to implement, on various real data sets. Our coreset exhibits a flexible tradeoff between coreset size and accuracy, and generally outperforms the uniform-sampling baseline. Furthermore, it significantly speeds up a Lloyd's-style heuristic for $k$-Means with missing values.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Vladimir Braverman (99 papers)
  2. Shaofeng H. -C. Jiang (31 papers)
  3. Robert Krauthgamer (87 papers)
  4. Xuan Wu (59 papers)
Citations (16)

Summary

We haven't generated a summary for this paper yet.