Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Differentially Private Clustering in Data Streams (2307.07449v2)

Published 14 Jul 2023 in cs.DS, cs.CR, and cs.LG

Abstract: The streaming model is an abstraction of computing over massive data streams, which is a popular way of dealing with large-scale modern data analysis. In this model, there is a stream of data points, one after the other. A streaming algorithm is only allowed one pass over the data stream, and the goal is to perform some analysis during the stream while using as small space as possible. Clustering problems (such as $k$-means and $k$-median) are fundamental unsupervised machine learning primitives, and streaming clustering algorithms have been extensively studied in the past. However, since data privacy becomes a central concern in many real-world applications, non-private clustering algorithms are not applicable in many scenarios. In this work, we provide the first differentially private streaming algorithms for $k$-means and $k$-median clustering of $d$-dimensional Euclidean data points over a stream with length at most $T$ using $poly(k,d,\log(T))$ space to achieve a constant multiplicative error and a $poly(k,d,\log(T))$ additive error. In particular, we present a differentially private streaming clustering framework which only requires an offline DP coreset or clustering algorithm as a blackbox. By plugging in existing results from DP clustering Ghazi, Kumar, Manurangsi 2020 and Kaplan, Stemmer 2018, we achieve (1) a $(1+\gamma)$-multiplicative approximation with $\tilde{O}_\gamma(poly(k,d,\log(T)))$ space for any $\gamma>0$, and the additive error is $poly(k,d,\log(T))$ or (2) an $O(1)$-multiplicative approximation with $\tilde{O}(k{1.5} \cdot poly(d,\log(T)))$ space and $poly(k,d,\log(T))$ additive error. In addition, our algorithmic framework is also differentially private under the continual release setting, i.e., the union of outputs of our algorithms at every timestamp is always differentially private.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Approximating extent measures of points. J. ACM, 51(4):606–635, 2004.
  2. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, 49, 2020.
  3. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137–147, 1999.
  4. Local search heuristic for k-median and facility location problems. In STOC, 2001.
  5. Stability yields a ptas for k-median and k-means clustering. In FOCS, 2010.
  6. Differentially private clustering in high-dimensional euclidean spaces. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 322–331. PMLR, 2017.
  7. Differentially-private sublinear-time clustering. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 332–337. IEEE, 2021.
  8. Streaming coreset constructions for m-estimators. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
  9. Private and continual release of statistics. ACM Trans. Inf. Syst. Secur., 14(3):26:1–26:24, 2011.
  10. Locally private k-means in one round. In International Conference on Machine Learning, pages 1441–1451. PMLR, 2021.
  11. A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences, 65, 2002.
  12. Ke Chen. On k-median clustering in high dimensions. In SODA, 2006.
  13. Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In SODA, 2008.
  14. Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923–947, 2009.
  15. Differentially-private clustering of easy instances. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2049–2059. PMLR, 18–24 Jul 2021.
  16. Scalable differentially private clustering via hierarchically separated trees. In KDD, page 221–230, New York, NY, USA, 2022. Association for Computing Machinery.
  17. Near-optimal private and scalable k𝑘kitalic_k-clustering. In Advances in Neural Information Processing Systems, 2022.
  18. Towards optimal lower bounds for k-median and k-means coresets. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1038–1051, 2022.
  19. A new coreset framework for clustering. In STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pages 169–182. ACM, 2021.
  20. Streaming euclidean k-median and k-means to a (1 + eps)-approximation with ok,e⁢p⁢ssubscript𝑜𝑘𝑒𝑝𝑠o_{k,eps}italic_o start_POSTSUBSCRIPT italic_k , italic_e italic_p italic_s end_POSTSUBSCRIPT(log n) memory words. In FOCS, 2023.
  21. Calibrating noise to sensitivity in private data analysis. J. Priv. Confidentiality, 7, 2016.
  22. Differential privacy under continual observation. In Leonard J. Schulman, editor, Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5-8 June 2010, pages 715–724. ACM, 2010.
  23. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
  24. Differentially private continual releases of streaming frequency moment estimations. CoRR, abs/2301.05605, 2023.
  25. Private coresets. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009, Bethesda, MD, USA, May 31 - June 2, 2009, pages 361–370. ACM, 2009.
  26. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569–578, 2011.
  27. Turning big data into tiny data: Constant-size coresets for k-means, pca, and projective clustering. SIAM Journal on Computing, 49(3):601–657, 2020.
  28. Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks, IPSN 2017, Pittsburgh, PA, USA, April 18-21, 2017, 2017.
  29. Differentially private clustering: Tight approximation ratios. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  30. Differentially private combinatorial optimization. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 1106–1125. SIAM, 2010.
  31. Smaller coresets for k-median and k-means clustering. In Proceedings of the twenty-first annual symposium on Computational geometry, pages 126–134, 2005.
  32. On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 291–300, 2004.
  33. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 291–300. ACM, 2004.
  34. Optimal differentially private algorithms for k-means clustering. In Jan Van den Bussche and Marcelo Arenas, editors, Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 395–408. ACM, 2018.
  35. Differential privacy for clustering under continual observation, 2023.
  36. Approximating k-median via pseudo-approximation. SIAM Journal on Computing, 45, 2016.
  37. Frank D McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 19–30, 2009.
  38. Smooth sensitivity and sampling in private data analysis. In David S. Johnson and Uriel Feige, editors, Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 11-13, 2007, pages 75–84. ACM, 2007.
  39. Clustering algorithms for the centralized and local models. In Algorithmic Learning Theory, ALT 2018, 7-9 April 2018, Lanzarote, Canary Islands, Spain, volume 83 of Proceedings of Machine Learning Research, pages 619–653. PMLR, 2018.
  40. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59, 2012.
  41. Near-linear time approximations schemes for clustering in doubling metrics. In David Zuckerman, editor, 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, Baltimore, Maryland, USA, November 9-12, 2019, pages 540–559. IEEE Computer Society, 2019.
  42. Differentially private k-means with constant multiplicative error. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, 2018.
  43. Distributed k-means clustering guaranteeing local differential privacy. Computers & Security, 90:101699, 2020.
  44. Discretized streams: An efficient and {{\{{Fault-Tolerant}}\}} model for stream processing on large clusters. In 4th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 12), 2012.
Citations (2)

Summary

We haven't generated a summary for this paper yet.