2000 character limit reached
ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale (2402.13726v2)
Published 21 Feb 2024 in cs.DS and cs.DB
Abstract: This work introduces ExaLogLog, a new data structure for approximate distinct counting, which has the same practical properties as the popular HyperLogLog algorithm. It is commutative, idempotent, mergeable, reducible, has a constant-time insert operation, and supports distinct counts up to the exa-scale. At the same time, as theoretically derived and experimentally verified, it requires 43% less space to achieve the same estimation error.
- [n.d.]. Apache Data Sketches: A software library of stochastic streaming algorithms. Retrieved February 19, 2024 from https://datasketches.apache.org/
- [n.d.]. Apache Data Sketches: Features Matrix for Distinct Count Sketches. Retrieved February 19, 2024 from https://datasketches.apache.org/docs/DistinctCountFeaturesMatrix.html
- The Space Complexity of Approximating the Frequency Moments. J. Comput. System Sci. 58, 1 (1999), 137–147. https://doi.org/10.1006/jcss.1997.1545
- D. N. Baker and B. Langmead. 2019. Dashing: fast and accurate genomic distances with HyperLogLog. Genome Biology 20, 265 (2019). https://doi.org/10.1186/s13059-019-1875-0
- Network-wide routing-oblivious heavy hitters. In Proceedings of the 16th Symposium on Architectures for Networking and Communications Systems (ANCS). 66–73. https://doi.org/10.1145/3230718.3230729
- HyperANF: Approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th International Conference on World Wide Web (WWW). 625–634. https://doi.org/10.1145/1963405.1963493
- KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome biology 19, 1 (2018), 1–10. https://doi.org/10.1186/s13059-018-1568-0
- How can sliding HyperLogLog and EWMA detect port scan attacks in IP traffic? EURASIP Journal on Information Security 2014, 5 (2014). https://doi.org/10.1186/1687-417X-2014-5
- Distinct Counting With a Self-Learning Bitmap. J. Amer. Statist. Assoc. 106, 495 (2011), 879–890. https://doi.org/10.1198/jasa.2011.ap10217
- DDoS Detection in P4 Using HYPERLOGLOG and COUNTMIN Sketches. In Network Operations and Management Symposium (NOMS). 1–6. https://doi.org/10.1109/NOMS56928.2023.10154315
- E. Cohen. 2015. All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis. IEEE Transactions on Knowledge and Data Engineering 27, 9 (2015), 2320–2334. https://doi.org/10.1109/TKDE.2015.2411606
- Y. Collet and M. Kucherawy. 2021. Zstandard Compression and the ’application/zstd’ Media Type. RFC 8878. https://doi.org/10.17487/RFC8878
- D. R. Cox and E. J. Snell. 1968. A General Definition of Residuals. Journal of the Royal Statistical Society. Series B (Methodological) 30, 2 (1968), 248–275. http://www.jstor.org/stable/2984505
- M. Durand. 2004. Combinatoire analytique et algorithmique des ensembles de données. Ph.D. Dissertation. École Polytechnique, Palaiseau, France. https://pastel.hal.science/pastel-00000810
- To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics. Nucleic Acids Research 48, 10 (2020), 5217–5234. https://doi.org/10.1093/nar/gkaa265
- O. Ertl. 2017. New cardinality estimation algorithms for HyperLogLog sketches. (2017). arXiv:1702.01284 [cs.DS]
- O. Ertl. 2023. UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct Counting (extended version). (2023). arXiv:2308.16862 [cs.DS]
- HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In Proceedings of the International Conference on the Analysis of Algorithms (AofA). 127–146. https://doi.org/10.46298/dmtcs.3545
- P. Flajolet and G. N. Martin. 1985. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences 31, 2 (1985), 182–209. https://doi.org/10.1016/0022-0000(85)90041-8
- M. J. Freitag and T. Neumann. 2019. Every Row Counts: Combining Sketches and Sampling for Accurate Group-By Result Estimates. In Proceedings of the 9th Conference on Innovative Data Systems Research (CIDR).
- HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm. In Proceedings of the 16th International Conference on Extending Database Technology (EDBT). 683–692. https://doi.org/10.1145/2452376.2452456
- M. Karppa and R. Pagh. 2022. HyperLogLogLog: Cardinality Estimation With One Log More. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 753––761. https://doi.org/10.1145/3534678.3539246
- K. J. Lang. 2017. Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm. (2017). arXiv:1708.06839 [cs.DS]
- Virtual self-adaptive bitmap for online cardinality estimation. Information Systems 114, 102160 (2023). https://doi.org/10.1016/j.is.2022.102160
- Sketching and Sublinear Data Structures in Genomics. Annual Review of Biomedical Data Science 2, 1 (2019), 93–118. https://doi.org/10.1146/annurev-biodatasci-072018-021156
- T. Ohayon. 2021. ExtendedHyperLogLog: Analysis of a new Cardinality Estimator. (2021). arXiv:2106.06525 [cs.DS]
- Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management Systems. In Proceedings of the 25th International Conference on Extending Database Technology (EDBT). https://doi.org/10.5441/002/edbt.2022.01
- O. Peters. [n.d.]. PolymurHash. Retrieved February 19, 2024 from https://github.com/orlp/polymur-hash
- S. Pettie and D. Wang. 2021. Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC). 556–569. https://doi.org/10.1145/3406325.3451032
- Non-Mergeable Sketching for Cardinality Estimation. In 48th International Colloquium on Automata, Languages, and Programming (ICALP), Vol. 198. 104:1–104:20. https://doi.org/10.4230/LIPIcs.ICALP.2021.104
- Estimating Edge-Local Triangle Count Heavy Hitters in Edge-Linear Time and Almost-Vertex-Linear Space. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC). https://doi.org/10.1109/HPEC.2018.8547721
- LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting. (2016). arXiv:1612.02284 [cs.DS]
- B. Scheuermann and M. Mauve. 2007. Near-optimal compression of probabilistic counting sketches for networking applications. In Proceedings of the 4th ACM International Workshop on Foundations of Mobile Computing (FOMC).
- R. Sedgewick. 2022. HyperBit: A Memory-Efficient Alternative to HyperLogLog. (2022). https://www.birs.ca/workshops/2022/22w5004/files/BobSedgewick/HyperBit.pdf Analytic and Probabilistic Combinatorics Workshop at the Banff International Research Station (BIRS) for Mathematical Innovation and Discovery.
- D. Ting. 2014. Streamed Approximate Counting of Distinct Elements: Beating Optimal Batch Methods. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 442–451. https://doi.org/10.1145/2623330.2623669
- D. Ting. 2019. Approximate distinct counts for billions of datasets. In Proceedings of the International Conference on Management of Data (SIGMOD). 69–86. https://doi.org/10.1145/3299869.3319897
- R. Urban. [n.d.]. SMhasher: Hash function quality and speed tests. Retrieved February 19, 2024 from https://github.com/rurban/smhasher
- A. Vaneev. [n.d.]. Komihash. Retrieved February 19, 2024 from https://github.com/avaneev/komihash/tree/b27fd681308f92a1fae617b4ecd0981cc69d31a0
- D. Wang and S. Pettie. 2023. Better Cardinality Estimators for HyperLogLog, PCSA, and Beyond. In Proceedings of the 42nd ACM Symposium on Principles of Database Systems (PODS). 317––327. https://doi.org/10.1145/3584372.3588680
- Characterizing storage workloads with counter stacks. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 335–349. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/wires
- Estimating Cardinality for Arbitrarily Large Data Stream With Improved Memory Efficiency. IEEE/ACM Transactions on Networking 28, 2 (2020), 433–446. https://doi.org/10.1109/TNET.2020.2970860
- Hermes: An Optimization of HyperLogLog Counting in real-time data processing. In Proceedings of the International Joint Conference on Neural Networks (IJCNN). 1890–1895. https://doi.org/10.1109/IJCNN.2016.7727430
- W. Yi. [n.d.]. Wyhash. Retrieved February 19, 2024 from https://github.com/wangyi-fudan/wyhash
- Y. Yu and G. M. Weber. 2022. HyperMinHash: MinHash in LogLog Space. IEEE Transactions on Knowledge & Data Engineering 34, 01 (2022), 328–339. https://doi.org/10.1109/TKDE.2020.2981311