Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 428 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

ByteCard: Enhancing ByteDance's Data Warehouse with Learned Cardinality Estimation (2403.16110v3)

Published 24 Mar 2024 in cs.DB

Abstract: Cardinality estimation is a critical component and a longstanding challenge in modern data warehouses. ByteHouse, ByteDance's cloud-native engine for extensive data analysis in exabyte-scale environments, serves numerous internal decision-making business scenarios. With the increasing demand for ByteHouse, cardinality estimation becomes the bottleneck for efficiently processing queries. Specifically, the existing query optimizer of ByteHouse uses the traditional Selinger-like cardinality estimator, which can produce substantial estimation errors, resulting in suboptimal query plans. To improve cardinality estimation accuracy while maintaining a practical inference overhead, we develop a framework ByteCard that enables efficient training and integration of learned cardinality estimators. Furthermore, ByteCard adapts recent advances in cardinality estimation to build models that can balance accuracy and practicality (e.g., inference latency, model size, training overhead). We observe significant query processing speed-up in ByteHouse after replacing the existing cardinality estimator with ByteCard for several optimization scenarios. Evaluations on real-world datasets show the integration of ByteCard leads to an improvement of up to 30% in the 99th quantile of latency. At last, we share our valuable experience in engineering advanced cardinality estimators. This experience can help ByteHouse integrate more learning-based solutions on the critical query execution path in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Column-oriented database systems. PVLDB 2, 2 (2009), 1664–1665.
  2. Materialization Strategies in a Column-Oriented DBMS. In ICDE. 466–475.
  3. Amazon Redshift re-invented. In SIGMOD. 2205–2217.
  4. David Beazley. 2010. Understanding the Python . In PyCON Python Conference. Atlanta, Georgia. 1–62.
  5. Language models are few-shot learners. NIPS 33 (2020), 1877–1901.
  6. Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. Journal of the American statistical Association 87, 417 (1992), 210–217.
  7. Towards estimation error guarantees for distinct values. In SIGMOD. 268–279.
  8. Workload-Aware Log-Structured Merge Key-Value Store for NVM-SSD Hybrid Storage. In ICDE. 2198–2210.
  9. CKCN Chow and Cong Liu. 1968. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory 14, 3 (1968), 462–467.
  10. Reuven Cohen and Yuval Nezri. 2019. Cardinality Estimation in a Virtualized Network Device Using Online Machine Learning. IEEE/ACM Transactions on Networking 27, 5 (2019), 2098–2110.
  11. The snowflake elastic data warehouse. In SIGMOD. 215–226.
  12. From {{\{{WiscKey}}\}} to Bourbon: A Learned Index for {{\{{Log-Structured}}\}} Merge Trees. In OSDI. 155–171.
  13. SageDB: An Instance-Optimized Data Analytics System. PVLDB 15, 13 (2022), 4062–4078.
  14. Chuong B Do and Serafim Batzoglou. 2008. What is the expectation maximization algorithm? Nature biotechnology 26, 8 (2008), 897–899.
  15. Selectivity estimation for range predicates using lightweight models. PVLDB 12, 9 (2019), 1044–1057.
  16. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science. 137–156.
  17. aGrUM: a Graphical Universal Model framework. In IEA/AIE. 171–177.
  18. Goetz Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19–29.
  19. An autonomous materialized view management system with deep reinforcement learning. In ICDE. 2159–2164.
  20. Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation. PVLDB 15, 4 (2021), 752–765.
  21. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT/ICDT. 683–692.
  22. Benjamin Hilprecht and Carsten Binnig. 2021. One model to rule them all: towards zero-shot learning for databases. arXiv:2105.00642 (2021).
  23. Benjamin Hilprecht and Carsten Binnig. 2022. Zero-Shot Cost Models for out-of-the-Box Learned Cost Prediction. PVLDB 15, 11 (2022), 2361–2374.
  24. DeepDB: learn from data, not from queries! PVLDB 13, 7 (2020), 992–1005.
  25. TiDB: A Raft-based HTAP Database. PVLDB 13, 12 (2020), 3072–3084.
  26. Zachary G Ives and Nicholas E Taylor. 2008. Sideways information passing for push-style query processing. In ICDE. 774–783.
  27. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR.
  28. Daphne Koller and Nir Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press.
  29. The Vertica Analytic Database: C-Store 7 Years Later. PVLDB 5, 12 (2012), 1790–1801.
  30. Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsoft SQL Server. PVLDB 16, 11 (2023), 2871–2883.
  31. How good are query optimizers, really? PVLDB 9, 3 (2015), 204–215.
  32. Query optimization through the looking glass, and what we found running the join order benchmark. PVLDB 27, 5 (2018), 643–668.
  33. opengauss: An autonomous database system. PVLDB 14, 12 (2021), 3028–3042.
  34. Opportunistic view materialization with deep reinforcement learning. arXiv:1903.01363 (2019).
  35. Fauce: fast and accurate deep ensembles with uncertainty for cardinality estimation. PVLDB 14, 11 (2021), 1950–1963.
  36. H-A Loeliger. 2004. An introduction to factor graphs. IEEE Signal Processing Magazine 21, 1 (2004), 28–41.
  37. Robust Query Driven Cardinality Estimation under Changing Workloads. PVLDB 16, 6 (2023), 1520–1533.
  38. Constructing and Analyzing the LSM Compaction Design Space. PVLDB 14, 11 (2021), 2216–2229.
  39. Auto-WLM: Machine learning enhanced workload management in Amazon Redshift. In Companion of the International Conference on Management of Data, SIGMOD/PODS. 225–237.
  40. Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3, 1-2 (2010), 460–471.
  41. Access path selection in a relational database management system. In SIGMOD. 23–34.
  42. Cost-based optimization for magic: Algebra and implementation. In SIGMOD. 435–446.
  43. Rover: An online Spark SQL tuning service via generalized transfer learning. In SIGKDD. 4800–4812.
  44. Materialization Strategies in the Vertica Analytic Database: Lessons Learned. In ICDE. 1196–1207.
  45. Cost models for big data query processing: Learning, retrofitting, and our findings. In SIGMOD. 99–113.
  46. Ji Sun and Guoliang Li. 2019. An End-to-End Learning-Based Cost Estimator. PVLDB 13, 3 (2019), 307–319.
  47. Learned cardinality estimation: A design space exploration and a comparative evaluation. PVLDB 15, 1 (2021), 85–97.
  48. Presto: A Decade of SQL Analytics at Meta. SIGMOD 1, 2 (2023), 1–25.
  49. XIndex: a scalable learned index for multicore data storage. In PPoPP. 308–320.
  50. Prediction Intervals for Learned Cardinality Estimation: An Experimental Evaluation. In ICDE. 3051–3064.
  51. Automatic database management system tuning through large-scale machine learning. In SIGMOD. 1009–1024.
  52. FACE: A normalizing flow based cardinality estimator. PVLDB 15, 1 (2021), 72–84.
  53. Emergent abilities of large language models. arXiv:2206.07682 (2022).
  54. Learning to Be a Statistician: Learned Estimator for Number of Distinct Values. PVLDB 15, 2 (2021), 272–284.
  55. FactorJoin: A New Cardinality Estimation Framework for Join Queries. SIGMOD 1, 1 (2023), 1–27.
  56. Ziniu Wu and Amir Shaikhha. 2020. BayesCard: A Unified Bayesian Framework for Cardinality Estimation. arXiv:2012.14743 (2020).
  57. A Unified Transferable Model for ML-Enhanced DBMS. CIDR (2022).
  58. NeuroCard: One Cardinality Estimator for All Tables. PVLDB 14, 1 (2021), 61–73.
  59. Deep unsupervised cardinality estimation. PVLDB 13, 3 (2019), 279–292.
  60. AnalyticDB: real-time OLAP database system at Alibaba cloud. PVLDB 12, 12 (2019), 2059–2070.
  61. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In SIGMOD. 415–432.
  62. ResTune: Resource Oriented Tuning Boosted by Meta-Learning for Cloud Databases. In SIGMOD. 2102–2114.
  63. FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. PVLDB 14, 9 (2021), 1489–1502.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.