Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions (2311.17868v1)

Published 29 Nov 2023 in cs.DS

Abstract: We revisit the problem of estimating the profile (also known as the rarity) in the data stream model. Given a sequence of $m$ elements from a universe of size $n$, its profile is a vector $\phi$ whose $i$-th entry $\phi_i$ represents the number of distinct elements that appear in the stream exactly $i$ times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry $\phi_i$ up to an additive error of $\pm \epsilon D$ using $O(1/\epsilon2 (\log n + \log m))$ bits of space, where $D$ is the number of distinct elements in the stream. In this paper, we considerably improve on this result by designing an algorithm which simultaneously estimates many coordinates of the profile vector $\phi$ up to small overall error. We give an algorithm which, with constant probability, produces an estimated profile $\hat\phi$ with the following guarantees in terms of space and estimation error: - For any constant $\tau$, with $O(1 / \epsilon2 + \log n)$ bits of space, $\sum_{i=1}\tau |\phi_i - \hat\phi_i| \leq \epsilon D$. - With $O(1/ \epsilon2\log (1/\epsilon) + \log n + \log \log m)$ bits of space, $\sum_{i=1}m |\phi_i - \hat\phi_i| \leq \epsilon m$. In addition to bounding the error across multiple coordinates, our space bounds separate the terms that depend on $1/\epsilon$ and those that depend on $n$ and $m$. We prove matching lower bounds on space in both regimes. Application of our profile estimation algorithm gives estimates within error $\pm \epsilon D$ of several symmetric functions of frequencies in $O(1/\epsilon2 + \log n)$ bits. This generalizes space-optimal algorithms for the distinct elements problems to other problems including estimating the Huber and Tukey losses as well as frequency cap statistics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. The bethe and sinkhorn permanents of low rank matrices and implications for profile maximum likelihood. In Conference on Learning Theory, pages 93–158. PMLR, 2021.
  2. A unified maximum likelihood approach for optimal distribution property estimation. In International Conference on Machine Learning. PMLR, 2017.
  3. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’96, page 20–29, New York, NY, USA, 1996. Association for Computing Machinery.
  4. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Algorithms, 4:17:1–17:25, 2008.
  5. Using data stream algorithms for computing properties of large graphs. In Workshop on Massive Geometric Data Sets (MASSIVE’05), pages 9–14, 2005.
  6. How to catch l2-heavy-hitters on sliding windows. Theoretical Computer Science, 554:82–94, 2014.
  7. Jaroslaw Blasiok. Optimal streaming and tracking distinct elements with high probability. ACM Trans. Algorithms, 16(1):3:1–3:28, 2020.
  8. Sampling Sketches for Concave Sublinear Functions of Frequencies. 2019.
  9. Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30(1):31–37, 2014.
  10. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In VLDB, volume 5, pages 25–36, 2005.
  11. Edith Cohen. Stream sampling for frequency cap statistics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 159–168, 2015.
  12. Edith Cohen. Hyperloglog hyperextended: Sketches for concave sublinear frequency statistics. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 105–114, New York, NY, USA, 2017. Association for Computing Machinery.
  13. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, page 780–791, New York, NY, USA, 2019. Association for Computing Machinery.
  14. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143–154, 1979.
  15. Charlie Dickens. Personal communication, 2023.
  16. Mayur Datar and S Muthukrishnan. Estimating rarity and similarity over data stream windows. In European Symposium on Algorithms, pages 323–335. Springer, 2002.
  17. Pan-private streaming algorithms. In ics, pages 66–80, 2010.
  18. Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences, 31(2):182–209, 1985.
  19. Exponential time improvement for min-wise based algorithms. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 57–66. SIAM, 2011.
  20. Multiparty reach and frequency histogram: Private, secure, and practical. Proceedings on Privacy Enhancing Technologies, 2022:373–395, 01 2022.
  21. Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73 – 101, 1964.
  22. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM), 53(3):307–323, 2006.
  23. Optimal approximations of the frequency moments of data streams. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 202–208, 2005.
  24. The one-way communication complexity of hamming distance. Theory Comput., 4(1):129–135, 2008.
  25. Truly perfect samplers for data streams and sliding windows. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’22, page 29–40, New York, NY, USA, 2022. Association for Computing Machinery.
  26. Detecting malicious network traffic using inverse distributions of packet contents. In Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, pages 165–170, 2005.
  27. On randomized one-round communication complexity. In Frank Thomson Leighton and Allan Borodin, editors, Proceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing, 29 May-1 June 1995, Las Vegas, Nevada, USA, pages 596–605. ACM, 1995.
  28. An optimal algorithm for the distinct elements problem. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 41–52, 2010.
  29. Pseudorandom hashing for space-bounded computation with applications to streaming. In Proceedings of the 64th Annual Symposium on Foundations of Computer Science (FOCS), 2023.
  30. Robert H. Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10):840–842, 1978.
  31. Optimal bounds for approximate counting. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’22, page 119–127, New York, NY, USA, 2022. Association for Computing Machinery.
  32. William J. J. Rey. Introduction to Robust and Quasi-Robust Statistical Methods. Universitext. Springer, Berlin, Heidelberg, 1983.
  33. Estimating the unseen: improved estimators for entropy and other properties. Journal of the ACM (JACM), 64(6):1–41, 2017.
  34. David Paul Woodruff. Efficient and private distance approximation in the communication and streaming models. PhD thesis, Massachusetts Institute of Technology, 2007.
Citations (3)

Summary

We haven't generated a summary for this paper yet.