Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Moderate Dimension Reduction for $k$-Center Clustering (2312.01391v5)

Published 3 Dec 2023 in cs.DS

Abstract: The Johnson-Lindenstrauss (JL) Lemma introduced the concept of dimension reduction via a random linear map, which has become a fundamental technique in many computational settings. For a set of $n$ points in $\mathbb{R}d$ and any fixed $\epsilon>0$, it reduces the dimension $d$ to $O(\log n)$ while preserving, with high probability, all the pairwise Euclidean distances within factor $1+\epsilon$. Perhaps surprisingly, the target dimension can be lower if one only wishes to preserve the optimal value of a certain problem on the pointset, e.g., Euclidean max-cut or $k$-means. However, for some notorious problems, like diameter (aka furthest pair), dimension reduction via the JL map to below $O(\log n)$ does not preserve the optimal value within factor $1+\epsilon$. We propose to focus on another regime, of \emph{moderate dimension reduction}, where a problem's value is preserved within factor $\alpha>1$ using target dimension $\tfrac{\log n}{poly(\alpha)}$. We establish the viability of this approach and show that the famous $k$-center problem is $\alpha$-approximated when reducing to dimension $O(\tfrac{\log n}{\alpha2}+\log k)$. Along the way, we address the diameter problem via the special case $k=1$. Our result extends to several important variants of $k$-center (with outliers, capacities, or fairness constraints), and the bound improves further with the input's doubling dimension. While our $poly(\alpha)$-factor improvement in the dimension may seem small, it actually has significant implications for streaming algorithms, and easily yields an algorithm for $k$-center in dynamic geometric streams, that achieves $O(\alpha)$-approximation using space $poly(kdn{1/\alpha2})$. This is the first algorithm to beat $O(n)$ space in high dimension $d$, as all previous algorithms require space at least $\exp(d)$. Furthermore, it extends to the $k$-center variants mentioned above.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In Proceedings of the 38th Annual ACM Symposium on Theory of Computing, STOC, pages 557–563, 2006. doi:10.1145/1132516.1132597.
  2. Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671–687, 2003. doi:10.1016/S0022-0000(03)00025-4.
  3. On the fine-grained complexity of approximating k-center in sparse graphs. In Symposium on Simplicity in Algorithms, SOSA, pages 145–155. SIAM, 2023. doi:10.1137/1.9781611977585.ch14.
  4. Efficient sketches for earth-mover distance, with applications. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, pages 324–330. IEEE Computer Society, 2009. doi:10.1109/FOCS.2009.25.
  5. Randomized embeddings with slack and high-dimensional approximate nearest neighbor. ACM Trans. Algorithms, 14(2):18:1–18:21, 2018. doi:10.1145/3178540.
  6. Earth mover distance over high-dimensional spaces. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 343–352, 2008. URL: http://dl.acm.org/citation.cfm?id=1347082.1347120.
  7. Exact and approximation algorithms for clustering. Algorithmica, 33(2):201–226, 2002. doi:10.1007/s00453-001-0110-y.
  8. Pankaj K. Agarwal and R. Sharathkumar. Streaming algorithms for extent problems in high dimensions. Algorithmica, 72(1):83–98, 2015. doi:10.1007/s00453-013-9846-4.
  9. Oblivious dimension reduction for k𝑘kitalic_k-means: beyond subspaces and the Johnson-Lindenstrauss Lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 1039–1050, 2019. doi:10.1145/3313276.3316318.
  10. The power of uniform sampling for coresets. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 462–473, 2022. doi:10.1109/FOCS54457.2022.00051.
  11. Clustering high dimensional dynamic data streams. In Proceedings of the 34th International Conference on Machine Learning, ICML, volume 70 of Proceedings of Machine Learning Research, pages 576–585. PMLR, 2017. URL: http://proceedings.mlr.press/v70/braverman17a.html.
  12. On coresets for fair clustering in metric and euclidean spaces and their applications. In 48th International Colloquium on Automata, Languages, and Programming, ICALP, volume 198 of LIPIcs, pages 23:1–23:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.ICALP.2021.23, doi:10.4230/LIPICS.ICALP.2021.23.
  13. How to allocate network centers. J. Algorithms, 15(3):385–415, 1993. doi:10.1006/jagm.1993.1047.
  14. Random projections for k𝑘kitalic_k-means clustering. In 24th Annual Conference on Neural Information Processing Systems, NeurIPS, pages 298–306. Curran Associates, Inc., 2010. URL: https://proceedings.neurips.cc/paper/2010/hash/73278a4a86960eeb576a8fd4c9ec6997-Abstract.html.
  15. Incremental clustering and dynamic information retrieval. SIAM J. Comput., 33(6):1417–1440, 2004. doi:10.1137/S0097539702418498.
  16. Streaming Euclidean MST to a Constant Factor. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC, pages 156–169, 2023. doi:10.1145/3564246.3585168.
  17. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC, pages 163–172, 2015. doi:10.1145/2746539.2746569.
  18. Streaming facility location in high dimension via geometric hashing. CoRR, 2022. The latest version has additional results compared to the preliminary version in [CJK+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT22]. arXiv:2204.02095.
  19. Streaming facility location in high dimension via geometric hashing. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 450–461, 2022. doi:10.1109/FOCS54457.2022.00050.
  20. Streaming Euclidean Max-Cut: Dimension vs data reduction. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC, pages 170–182, 2023. doi:10.1145/3564246.3585170.
  21. New streaming algorithms for high dimensional EMD and MST. In 54th Annual Symposium on Theory of Computing, STOC, pages 222–233. ACM, 2022. doi:10.1145/3519935.3519979.
  22. Fair clustering through fairlets. In Annual Conference on Neural Information Processing Systems, NeurIPS, pages 5029–5037, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/978fce5bcc4eccc88ad48ce3914124a2-Abstract.html.
  23. Kenneth L. Clarkson. Nearest neighbor queries in metric spaces. Discret. Comput. Geom., 22(1):63–93, 1999. doi:10.1007/PL00009449.
  24. Solving k𝑘kitalic_k-center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially. Proc. VLDB Endow., 12(7):766–778, 2019. doi:10.14778/3317315.3317319.
  25. Diameter and k𝑘kitalic_k-Center in Sliding Windows. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016), volume 55 of Leibniz International Proceedings in Informatics (LIPIcs), pages 19:1–19:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPIcs.ICALP.2016.19.
  26. The Johnson-Lindenstrauss Lemma for clustering and subspace approximation: From coresets to dimension reduction. CoRR, 2022. arXiv:2205.00371.
  27. k𝑘kitalic_k-center clustering with outliers in the MPC and streaming model. In IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023, pages 853–863. IEEE, 2023. doi:10.1109/IPDPS54959.2023.00090.
  28. k𝑘kitalic_k-center clustering with outliers in the sliding-window model. In 29th Annual European Symposium on Algorithms, ESA, volume 204 of Leibniz International Proceedings in Informatics (LIPIcs), pages 13:1–13:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.ESA.2021.13.
  29. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 2003. doi:10.1002/rsa.10073.
  30. A sparse Johnson–Lindenstrauss transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC, pages 341–350, 2010. doi:10.1145/1806689.1806737.
  31. Some geometric applications of the beta distribution. Ann. Inst. Stat. Math., 42(3):463–474, 1990. doi:10.1007/BF00049302.
  32. Bounded geometries, fractals, and low-distortion embeddings. In 44th Symposium on Foundations of Computer Science, FOCS, pages 534–543. IEEE Computer Society, 2003. doi:10.1109/SFCS.2003.1238226.
  33. Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293–306, 1985. doi:10.1016/0304-3975(85)90224-5.
  34. Coresets for clustering with fairness constraints. In Advances in Neural Information Processing Systems 32, NeurIPS, pages 7587–7598, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/810dfbbebb17302018ae903e9cb7a483-Abstract.html.
  35. Nearly optimal dynamic k𝑘kitalic_k-means clustering for high-dimensional data. 2018. arXiv:1802.00459.
  36. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, STOC, pages 604–613, 1998. doi:10.1145/276698.276876.
  37. Nearest-neighbor-preserving embeddings. ACM Trans. Algorithms, 3(3):31, 2007. doi:10.1145/1273340.1273347.
  38. Piotr Indyk. Better algorithms for high-dimensional proximity problems via asymmetric embeddings. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 539–545, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644200.
  39. Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, STOC, page 373–380, 2004. doi:10.1145/1007352.1007413.
  40. Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics, 26:189–206, 01 1984. doi:10.1090/conm/026/737400.
  41. An improved data stream algorithm for clustering. Comput. Geom., 48(9):635–645, 2015. doi:10.1016/j.comgeo.2015.06.003.
  42. Empirical processes and random projections. Journal of Functional Analysis, 225(1):229–245, 2005. doi:10.1016/j.jfa.2004.10.009.
  43. The capacitated K-center problem. SIAM J. Discret. Math., 13(3):403–418, 2000. doi:10.1137/S0895480197329776.
  44. Christiane Lammersen. Approximation Techniques for Facility Location and Their Applications in Metric Embeddings. PhD thesis, Dortmund, Technische Universität, 2010.
  45. B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 – 1338, 2000. doi:10.1214/aos/1015957395.
  46. Optimality of the Johnson-Lindenstrauss lemma. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 633–638, 2017. doi:10.1109/FOCS.2017.64.
  47. Streaming embeddings with slack. In 11th International Symposium on Algorithms and Data Structures, WADS, volume 5664 of Lecture Notes in Computer Science, pages 483–494. Springer, 2009. doi:10.1007/978-3-642-03367-4_42.
  48. Jirí Matousek. On variants of the Johnson-Lindenstrauss Lemma. Random Struct. Algorithms, 33(2):142–156, 2008. doi:10.1002/rsa.20218.
  49. Streaming algorithms for k𝑘kitalic_k-center clustering with outliers and with anonymity. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, volume 5171 of Lecture Notes in Computer Science, pages 165–178. Springer, 2008. doi:10.1007/978-3-540-85363-3_14.
  50. Performance of Johnson-Lindenstrauss transform for k𝑘kitalic_k-means and k𝑘kitalic_k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 1027–1038, 2019. doi:10.1145/3313276.3316350.
  51. Jelani Nelson. Dimensionality reduction in Euclidean space. Notices of the American Mathematical Society, 67(10):1498–1507, 2020. doi:10.1090/noti2166.
  52. Randomized dimensionality reduction for facility location and single-linkage clustering. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pages 7948–7957. PMLR, 2021. URL: http://proceedings.mlr.press/v139/narayanan21b.html.
  53. C. A. Rogers. Covering a sphere with spheres. Mathematika, 10(2):157–164, 1963. doi:10.1112/S0025579300004083.
  54. Fair coresets and streaming algorithms for fair k𝑘kitalic_k-means. In Approximation and Online Algorithms - 17th International Workshop, WAOA, volume 11926 of Lecture Notes in Computer Science, pages 232–251. Springer, 2019. doi:10.1007/978-3-030-39479-0_16.
  55. Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. doi:10.1017/9781108231596.
  56. High-dimensional geometric streaming in polynomial space. In 63rd Annual Symposium on Foundations of Computer Science, FOCS, pages 732–743. IEEE, 2022. doi:10.1109/FOCS54457.2022.00075.

Summary

We haven't generated a summary for this paper yet.