Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Count-Min Sketch with Conservative Updates: Worst-Case Analysis (2405.12034v2)

Published 20 May 2024 in cs.DS and cs.PF

Abstract: Count-Min Sketch with Conservative Updates (CMS-CU) is a memory-efficient hash-based data structure used to estimate the occurrences of items within a data stream. CMS-CU stores $m$ counters and employs $d$ hash functions to map items to these counters. We first argue that the estimation error in CMS-CU is maximal when each item appears at most once in the stream. Next, we study CMS-CU in this setting. In the case where $d=m-1$, we prove that the average estimation error and the average counter rate converge almost surely to $\frac{1}{2}$, contrasting with the vanilla Count-Min Sketch, where the average counter rate is equal to $\frac{m-1}{m}$. For any given $m$ and $d$, we prove novel lower and upper bounds on the average estimation error, incorporating a positive integer parameter $g$. Larger values of this parameter improve the accuracy of the bounds. Moreover, the computation of each bound involves examining an ergodic Markov process with a state space of size $\binom{m+g-d}{g}$ and a sparse transition probabilities matrix containing $\mathcal{O}(m\binom{m+g-d}{g})$ non-zero entries. For $d=m-1$, $g=1$, and as $m\to \infty$, we show that the lower and upper bounds coincide. In general, our bounds exhibit high accuracy for small values of $g$, as shown by numerical computation. For example, for $m=50$, $d=4$, and $g=5$, the difference between the lower and upper bounds is smaller than $10{-4}$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Salsa: Self-adjusting lean streaming analytics. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021. doi:10.1109/ICDE51399.2021.00080.
  2. Analyzing count min sketch with conservative updates. Computer Networks, 217, 2022. doi:10.1016/j.comnet.2022.109315.
  3. Balanced allocations: The heavily loaded case. In Proceedings of the thirty-second annual ACM symposium on Theory of computing, pages 745–754, 2000.
  4. Modeling conservative updates in multi-hash approximate count sketches. In 2012 24th International Teletraffic Congress (ITC 24), pages 1–8. IEEE, 2012.
  5. Network Applications of Bloom Filters: A Survey. Internet Mathematics, 1, 2003.
  6. Finding frequent items in data streams. Theoretical Computer Science, 312, 2004. doi:10.1016/S0304-3975(03)00400-6.
  7. Spectral bloom filters. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 2003. doi:10.1145/872757.872787.
  8. Graham Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In Proceedings of the SIAM International Conference on Data Mining (SDM), 2005. doi:10.1137/1.9781611972757.5.
  9. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.
  10. Graham Cormode and Ke Yi. Small Summaries for Big Data. Cambridge University Press, 2020.
  11. A formal analysis of conservative update based approximate counting. In International Conference on Computing, Networking and Communications (ICNC), 2005. doi:10.1109/ICCNC.2015.7069350.
  12. Tinylfu: A highly efficient cache admission policy. ACM Trans. Storage, 13, 2017. doi:10.1145/3149371.
  13. New directions in traffic measurement and accounting. SIGCOMM Comput. Commun. Rev., 2002. doi:10.1145/964725.633056.
  14. William Feller. An Introduction to Probability Theory and its Applications, volume 1. John Wiley, 3rd edition, 1968.
  15. Count-min sketch with variable number of hash functions: An experimental study. In String Processing and Information Retrieval: 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26–28, 2023, Proceedings, 2023. doi:10.1007/978-3-031-43980-3_17.
  16. Phase transition in count approximation by count-min sketch with conservative updates. In Algorithms and Complexity. CIAC, 2023. doi:10.1007/978-3-031-30448-4_17.
  17. A probabilistic data structures-based anomaly detection scheme for software-defined internet of vehicles. IEEE Transactions on Intelligent Transportation Systems, 22, 2021. doi:10.1109/TITS.2020.2988065.
  18. Sketch algorithms for estimating point queries in nlp. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012.
  19. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, 2nd edition, 1994. URL: https://www.amazon.com/Concrete-Mathematics-Foundation-Computer-Science/dp/0201558025.
  20. Learning-based frequency estimation algorithms. In International Conference on Learning Representations, 2019.
  21. Netcache: Balancing key-value stores with fast in-network caching. In Proceedings of the 26th Symposium on Operating Systems Principles, 2017. doi:10.1145/3132747.3132764.
  22. Why simple hash functions work: exploiting the entropy in a data stream. In SODA, volume 8, pages 746–755. Citeseer, 2008.
  23. Jelani Nelson. Sketching and streaming algorithms for processing massive data. XRDS, 19(1):14–19, sep 2012. doi:10.1145/2331042.2331049.
  24. Sheldon M. Ross. Introduction to Probability Models. Academic Press, 9th edition, 2007.
  25. Set-min sketch: A probabilistic map for power-law distributions with application to k-mer annotation. Journal of Computational Biology, 29, 2022. doi:10.1089/cmb.2021.0429.
  26. Diamond sketch: Accurate per-flow measurement for big streaming data. IEEE Transactions on Parallel and Distributed Systems, 30, 2019. doi:10.1109/TPDS.2019.2923772.
  27. Heavykeeper: An accurate algorithm for finding top- k𝑘kitalic_k elephant flows. IEEE/ACM Transactions on Networking, 27, 2019. doi:10.1109/TNET.2019.2933868.

Summary

We haven't generated a summary for this paper yet.