Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries (2402.15953v3)

Published 25 Feb 2024 in cs.DB

Abstract: With the increasing rate of data generated by critical systems, estimating functions on streaming data has become essential. This demand has driven numerous advancements in algorithms designed to efficiently query and analyze one or more data streams while operating under memory constraints. The primary challenge arises from the rapid influx of new items, requiring algorithms that enable efficient incremental processing of streams in order to keep up. A prominent algorithm in this domain is the AMS sketch. Originally developed to estimate the second frequency moment of a data stream, it can also estimate the cardinality of the equi-join between two relations. Since then, two important advancements are the Count sketch, a method which significantly improves upon the sketch update time, and secondly, an extension of the AMS sketch to accommodate multi-join queries. However, combining the strengths of these methods to maintain sketches for multi-join queries while ensuring fast update times is a non-trivial task, and has remained an open problem for decades as highlighted in the existing literature. In this work, we successfully address this problem by introducing a novel sketching method which has fast updates, even for sketches capable of accurately estimating the cardinality of complex multi-join queries. We prove that our estimator is unbiased and has the same error guarantees as the AMS-based method. Our experimental results confirm the significant improvement in update time complexity, resulting in orders of magnitude faster estimates, with equal or better estimation accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Charu C Aggarwal and Philip S Yu. 2007. A survey of synopsis construction in data streams. Data streams: models and algorithms (2007), 169–207.
  2. The Power of Hashing with Mersenne Primes. arXiv preprint arXiv:2008.08654 (2020).
  3. Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART Dymposium on Principles of Database Systems. 10–20.
  4. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM Symposium on Theory of computing (STOC). 20–29.
  5. Shivnath Babu and Jennifer Widom. 2001. Continuous queries over data streams. ACM SIGMOD Record 30, 3 (2001), 109–120.
  6. IBM infosphere streams for scalable, real-time, intelligent transportation services. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 1093–1104.
  7. Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426.
  8. Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In Proceedings of the 2019 International Conference on Management of Data. 18–35.
  9. Interactive outlier exploration in big data streams. Proceedings of the VLDB Endowment 7, 13 (2014), 1621–1624.
  10. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming. Springer, 693–703.
  11. Surajit Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. 34–43.
  12. Terec: A temporal recommender system over tweet stream. Proceedings of the VLDB Endowment 6, 12 (2013), 1254–1257.
  13. STAR: A distributed stream warehouse system for spatial data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2761–2764.
  14. Graham Cormode. 2011. Sketch techniques for approximate query processing. Foundations and Trends® in Databases (2011), 15.
  15. Graham Cormode. 2022. Current Trends in Data Summaries. ACM SIGMOD Record 50, 4 (2022), 6–15.
  16. Graham Cormode and Minos Garofalakis. 2005. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the 31st international conference on Very large Data Bases (VLDB). 13–24.
  17. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases 4, 1–3 (2011), 1–294.
  18. Finding hierarchical heavy hitters in data streams. In Proceedings 2003 VLDB Conference. Elsevier, 464–475.
  19. Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58–75.
  20. Processing complex aggregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 61–72.
  21. Tracking set-expression cardinalities over continuous update streams. The VLDB Journal 13, 4 (2004), 354–369.
  22. Complex event recognition in the big data era. Proceedings of the VLDB Endowment 10, 12 (2017), 1996–1999.
  23. Phillip B Gibbons and Yossi Matias. 1999. Synopsis data structures for massive data sets. External Memory Algorithms 50 (1999), 39–70.
  24. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, Vol. 1. 79–88.
  25. Sketch algorithms for estimating point queries in nlp. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. 1093–1103.
  26. Cardinality estimation in DBMS: a comprehensive benchmark evaluation. Proceedings of the VLDB Endowment 15, 4 (2021), 752–765.
  27. DeepDB: learn from data, not from queries! Proceedings of the VLDB Endowment 13, 7 (2020), 992–1005.
  28. Tencentrec: Real-time stream recommendation in practice. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 227–238.
  29. COMPASS: Online sketch-based query optimization for in-memory databases. In Proceedings of the 2021 International Conference on Management of Data. 804–816.
  30. A survey on advancing the dbms query optimizer: Cardinality estimation, cost model, and plan enumeration. Data Science and Engineering 6 (2021), 86–101.
  31. How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (2015), 204–215.
  32. Kaiyu Li and Guoliang Li. 2018. Approximate query processing: What is new and where to go? A survey on approximate query processing. Data Science and Engineering 3 (2018), 379–397.
  33. Tensor Sketch. Tensor Computation for Data Analysis (2022), 299–321.
  34. Nishad Manerikar and Themis Palpanas. 2009. Frequent items in streaming data: An experimental evaluation of the state-of-the-art. Data & Knowledge Engineering 68, 4 (2009), 415–430.
  35. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment 2, 1 (2009), 982–993.
  36. Magnus Müller. 2022. Selected problems in cardinality estimation. (2022).
  37. Rasmus Pagh. 2013. Compressed matrix multiplication. ACM Transactions on Computation Theory (TOCT) 5, 3 (2013), 1–17.
  38. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
  39. Ninh Pham and Rasmus Pagh. 2013. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 239–247.
  40. Nonparametric monitoring of data streams for changes in location and scale. Technometrics 53, 4 (2011), 379–389.
  41. Augmented sketch: Faster and more accurate stream processing. In Proceedings of the 2016 International Conference on Management of Data. 1449–1463.
  42. Florin Rusu and Alin Dobra. 2008. Sketches for size of join estimation. ACM Transactions on Database Systems (TODS) 33, 3 (2008), 1–46.
  43. Yang Shi and Animashree Anandkumar. 2019. Higher-order count sketch: dimensionality reduction that retains efficient tensor operations. arXiv preprint arXiv:1901.11261 (2019).
  44. The 8 requirements of real-time stream processing. ACM Sigmod Record 34, 4 (2005), 42–47.
  45. Mikkel Thorup and Yin Zhang. 2004. Tabulation based 4-universal hashing with applications to second moment estimation. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms. 615–624.
  46. Join size estimation subject to filter conditions. Proceedings of the VLDB Endowment 8, 12 (2015), 1530–1541.
  47. JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product Estimation. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–26.
  48. Mark N Wegman and J Lawrence Carter. 1981. New hash functions and their use in authentication and set equality. Journal of computer and system sciences 22, 3 (1981), 265–279.
  49. Feature hashing for large scale multitask learning. In Proceedings of the 26th annual international conference on machine learning. 1113–1120.
  50. Bayescard: Revitilizing bayesian frameworks for cardinality estimation. arXiv preprint arXiv:2012.14743 (2020).
  51. Pyramid sketch: A sketch framework for frequency estimation of data streams. Proceedings of the VLDB Endowment 10, 11 (2017), 1442–1453.
  52. NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment 14, 1 (2020), 61–73.
  53. FLAT: fast, lightweight and accurate method for cardinality estimation. Proceedings of the VLDB Endowment 14, 9 (2021), 1489–1502.
Citations (1)

Summary

We haven't generated a summary for this paper yet.