Count-Min Sketch with Conservative Updates: Worst-Case Analysis (2405.12034v2)
Abstract: Count-Min Sketch with Conservative Updates (CMS-CU) is a memory-efficient hash-based data structure used to estimate the occurrences of items within a data stream. CMS-CU stores $m$ counters and employs $d$ hash functions to map items to these counters. We first argue that the estimation error in CMS-CU is maximal when each item appears at most once in the stream. Next, we study CMS-CU in this setting. In the case where $d=m-1$, we prove that the average estimation error and the average counter rate converge almost surely to $\frac{1}{2}$, contrasting with the vanilla Count-Min Sketch, where the average counter rate is equal to $\frac{m-1}{m}$. For any given $m$ and $d$, we prove novel lower and upper bounds on the average estimation error, incorporating a positive integer parameter $g$. Larger values of this parameter improve the accuracy of the bounds. Moreover, the computation of each bound involves examining an ergodic Markov process with a state space of size $\binom{m+g-d}{g}$ and a sparse transition probabilities matrix containing $\mathcal{O}(m\binom{m+g-d}{g})$ non-zero entries. For $d=m-1$, $g=1$, and as $m\to \infty$, we show that the lower and upper bounds coincide. In general, our bounds exhibit high accuracy for small values of $g$, as shown by numerical computation. For example, for $m=50$, $d=4$, and $g=5$, the difference between the lower and upper bounds is smaller than $10{-4}$.
- Salsa: Self-adjusting lean streaming analytics. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021. doi:10.1109/ICDE51399.2021.00080.
- Analyzing count min sketch with conservative updates. Computer Networks, 217, 2022. doi:10.1016/j.comnet.2022.109315.
- Balanced allocations: The heavily loaded case. In Proceedings of the thirty-second annual ACM symposium on Theory of computing, pages 745–754, 2000.
- Modeling conservative updates in multi-hash approximate count sketches. In 2012 24th International Teletraffic Congress (ITC 24), pages 1–8. IEEE, 2012.
- Network Applications of Bloom Filters: A Survey. Internet Mathematics, 1, 2003.
- Finding frequent items in data streams. Theoretical Computer Science, 312, 2004. doi:10.1016/S0304-3975(03)00400-6.
- Spectral bloom filters. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 2003. doi:10.1145/872757.872787.
- Graham Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In Proceedings of the SIAM International Conference on Data Mining (SDM), 2005. doi:10.1137/1.9781611972757.5.
- An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.
- Graham Cormode and Ke Yi. Small Summaries for Big Data. Cambridge University Press, 2020.
- A formal analysis of conservative update based approximate counting. In International Conference on Computing, Networking and Communications (ICNC), 2005. doi:10.1109/ICCNC.2015.7069350.
- Tinylfu: A highly efficient cache admission policy. ACM Trans. Storage, 13, 2017. doi:10.1145/3149371.
- New directions in traffic measurement and accounting. SIGCOMM Comput. Commun. Rev., 2002. doi:10.1145/964725.633056.
- William Feller. An Introduction to Probability Theory and its Applications, volume 1. John Wiley, 3rd edition, 1968.
- Count-min sketch with variable number of hash functions: An experimental study. In String Processing and Information Retrieval: 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26–28, 2023, Proceedings, 2023. doi:10.1007/978-3-031-43980-3_17.
- Phase transition in count approximation by count-min sketch with conservative updates. In Algorithms and Complexity. CIAC, 2023. doi:10.1007/978-3-031-30448-4_17.
- A probabilistic data structures-based anomaly detection scheme for software-defined internet of vehicles. IEEE Transactions on Intelligent Transportation Systems, 22, 2021. doi:10.1109/TITS.2020.2988065.
- Sketch algorithms for estimating point queries in nlp. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012.
- Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, 2nd edition, 1994. URL: https://www.amazon.com/Concrete-Mathematics-Foundation-Computer-Science/dp/0201558025.
- Learning-based frequency estimation algorithms. In International Conference on Learning Representations, 2019.
- Netcache: Balancing key-value stores with fast in-network caching. In Proceedings of the 26th Symposium on Operating Systems Principles, 2017. doi:10.1145/3132747.3132764.
- Why simple hash functions work: exploiting the entropy in a data stream. In SODA, volume 8, pages 746–755. Citeseer, 2008.
- Jelani Nelson. Sketching and streaming algorithms for processing massive data. XRDS, 19(1):14–19, sep 2012. doi:10.1145/2331042.2331049.
- Sheldon M. Ross. Introduction to Probability Models. Academic Press, 9th edition, 2007.
- Set-min sketch: A probabilistic map for power-law distributions with application to k-mer annotation. Journal of Computational Biology, 29, 2022. doi:10.1089/cmb.2021.0429.
- Diamond sketch: Accurate per-flow measurement for big streaming data. IEEE Transactions on Parallel and Distributed Systems, 30, 2019. doi:10.1109/TPDS.2019.2923772.
- Heavykeeper: An accurate algorithm for finding top- k𝑘kitalic_k elephant flows. IEEE/ACM Transactions on Networking, 27, 2019. doi:10.1109/TNET.2019.2933868.