Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale (2402.13726v2)

Published 21 Feb 2024 in cs.DS and cs.DB

Abstract: This work introduces ExaLogLog, a new data structure for approximate distinct counting, which has the same practical properties as the popular HyperLogLog algorithm. It is commutative, idempotent, mergeable, reducible, has a constant-time insert operation, and supports distinct counts up to the exa-scale. At the same time, as theoretically derived and experimentally verified, it requires 43% less space to achieve the same estimation error.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. [n.d.]. Apache Data Sketches: A software library of stochastic streaming algorithms. Retrieved February 19, 2024 from https://datasketches.apache.org/
  2. [n.d.]. Apache Data Sketches: Features Matrix for Distinct Count Sketches. Retrieved February 19, 2024 from https://datasketches.apache.org/docs/DistinctCountFeaturesMatrix.html
  3. The Space Complexity of Approximating the Frequency Moments. J. Comput. System Sci. 58, 1 (1999), 137–147. https://doi.org/10.1006/jcss.1997.1545
  4. D. N. Baker and B. Langmead. 2019. Dashing: fast and accurate genomic distances with HyperLogLog. Genome Biology 20, 265 (2019). https://doi.org/10.1186/s13059-019-1875-0
  5. Network-wide routing-oblivious heavy hitters. In Proceedings of the 16th Symposium on Architectures for Networking and Communications Systems (ANCS). 66–73. https://doi.org/10.1145/3230718.3230729
  6. HyperANF: Approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th International Conference on World Wide Web (WWW). 625–634. https://doi.org/10.1145/1963405.1963493
  7. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome biology 19, 1 (2018), 1–10. https://doi.org/10.1186/s13059-018-1568-0
  8. How can sliding HyperLogLog and EWMA detect port scan attacks in IP traffic? EURASIP Journal on Information Security 2014, 5 (2014). https://doi.org/10.1186/1687-417X-2014-5
  9. Distinct Counting With a Self-Learning Bitmap. J. Amer. Statist. Assoc. 106, 495 (2011), 879–890. https://doi.org/10.1198/jasa.2011.ap10217
  10. DDoS Detection in P4 Using HYPERLOGLOG and COUNTMIN Sketches. In Network Operations and Management Symposium (NOMS). 1–6. https://doi.org/10.1109/NOMS56928.2023.10154315
  11. E. Cohen. 2015. All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis. IEEE Transactions on Knowledge and Data Engineering 27, 9 (2015), 2320–2334. https://doi.org/10.1109/TKDE.2015.2411606
  12. Y. Collet and M. Kucherawy. 2021. Zstandard Compression and the ’application/zstd’ Media Type. RFC 8878. https://doi.org/10.17487/RFC8878
  13. D. R. Cox and E. J. Snell. 1968. A General Definition of Residuals. Journal of the Royal Statistical Society. Series B (Methodological) 30, 2 (1968), 248–275. http://www.jstor.org/stable/2984505
  14. M. Durand. 2004. Combinatoire analytique et algorithmique des ensembles de données. Ph.D. Dissertation. École Polytechnique, Palaiseau, France. https://pastel.hal.science/pastel-00000810
  15. To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics. Nucleic Acids Research 48, 10 (2020), 5217–5234. https://doi.org/10.1093/nar/gkaa265
  16. O. Ertl. 2017. New cardinality estimation algorithms for HyperLogLog sketches. (2017). arXiv:1702.01284 [cs.DS]
  17. O. Ertl. 2023. UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct Counting (extended version). (2023). arXiv:2308.16862 [cs.DS]
  18. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In Proceedings of the International Conference on the Analysis of Algorithms (AofA). 127–146. https://doi.org/10.46298/dmtcs.3545
  19. P. Flajolet and G. N. Martin. 1985. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences 31, 2 (1985), 182–209. https://doi.org/10.1016/0022-0000(85)90041-8
  20. M. J. Freitag and T. Neumann. 2019. Every Row Counts: Combining Sketches and Sampling for Accurate Group-By Result Estimates. In Proceedings of the 9th Conference on Innovative Data Systems Research (CIDR).
  21. HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm. In Proceedings of the 16th International Conference on Extending Database Technology (EDBT). 683–692. https://doi.org/10.1145/2452376.2452456
  22. M. Karppa and R. Pagh. 2022. HyperLogLogLog: Cardinality Estimation With One Log More. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 753––761. https://doi.org/10.1145/3534678.3539246
  23. K. J. Lang. 2017. Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm. (2017). arXiv:1708.06839 [cs.DS]
  24. Virtual self-adaptive bitmap for online cardinality estimation. Information Systems 114, 102160 (2023). https://doi.org/10.1016/j.is.2022.102160
  25. Sketching and Sublinear Data Structures in Genomics. Annual Review of Biomedical Data Science 2, 1 (2019), 93–118. https://doi.org/10.1146/annurev-biodatasci-072018-021156
  26. T. Ohayon. 2021. ExtendedHyperLogLog: Analysis of a new Cardinality Estimator. (2021). arXiv:2106.06525 [cs.DS]
  27. Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management Systems. In Proceedings of the 25th International Conference on Extending Database Technology (EDBT). https://doi.org/10.5441/002/edbt.2022.01
  28. O. Peters. [n.d.]. PolymurHash. Retrieved February 19, 2024 from https://github.com/orlp/polymur-hash
  29. S. Pettie and D. Wang. 2021. Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC). 556–569. https://doi.org/10.1145/3406325.3451032
  30. Non-Mergeable Sketching for Cardinality Estimation. In 48th International Colloquium on Automata, Languages, and Programming (ICALP), Vol. 198. 104:1–104:20. https://doi.org/10.4230/LIPIcs.ICALP.2021.104
  31. Estimating Edge-Local Triangle Count Heavy Hitters in Edge-Linear Time and Almost-Vertex-Linear Space. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC). https://doi.org/10.1109/HPEC.2018.8547721
  32. LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting. (2016). arXiv:1612.02284 [cs.DS]
  33. B. Scheuermann and M. Mauve. 2007. Near-optimal compression of probabilistic counting sketches for networking applications. In Proceedings of the 4th ACM International Workshop on Foundations of Mobile Computing (FOMC).
  34. R. Sedgewick. 2022. HyperBit: A Memory-Efficient Alternative to HyperLogLog. (2022). https://www.birs.ca/workshops/2022/22w5004/files/BobSedgewick/HyperBit.pdf Analytic and Probabilistic Combinatorics Workshop at the Banff International Research Station (BIRS) for Mathematical Innovation and Discovery.
  35. D. Ting. 2014. Streamed Approximate Counting of Distinct Elements: Beating Optimal Batch Methods. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 442–451. https://doi.org/10.1145/2623330.2623669
  36. D. Ting. 2019. Approximate distinct counts for billions of datasets. In Proceedings of the International Conference on Management of Data (SIGMOD). 69–86. https://doi.org/10.1145/3299869.3319897
  37. R. Urban. [n.d.]. SMhasher: Hash function quality and speed tests. Retrieved February 19, 2024 from https://github.com/rurban/smhasher
  38. A. Vaneev. [n.d.]. Komihash. Retrieved February 19, 2024 from https://github.com/avaneev/komihash/tree/b27fd681308f92a1fae617b4ecd0981cc69d31a0
  39. D. Wang and S. Pettie. 2023. Better Cardinality Estimators for HyperLogLog, PCSA, and Beyond. In Proceedings of the 42nd ACM Symposium on Principles of Database Systems (PODS). 317––327. https://doi.org/10.1145/3584372.3588680
  40. Characterizing storage workloads with counter stacks. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 335–349. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/wires
  41. Estimating Cardinality for Arbitrarily Large Data Stream With Improved Memory Efficiency. IEEE/ACM Transactions on Networking 28, 2 (2020), 433–446. https://doi.org/10.1109/TNET.2020.2970860
  42. Hermes: An Optimization of HyperLogLog Counting in real-time data processing. In Proceedings of the International Joint Conference on Neural Networks (IJCNN). 1890–1895. https://doi.org/10.1109/IJCNN.2016.7727430
  43. W. Yi. [n.d.]. Wyhash. Retrieved February 19, 2024 from https://github.com/wangyi-fudan/wyhash
  44. Y. Yu and G. M. Weber. 2022. HyperMinHash: MinHash in LogLog Space. IEEE Transactions on Knowledge & Data Engineering 34, 01 (2022), 328–339. https://doi.org/10.1109/TKDE.2020.2981311

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com