ByteCard: Enhancing ByteDance's Data Warehouse with Learned Cardinality Estimation (2403.16110v3)
Abstract: Cardinality estimation is a critical component and a longstanding challenge in modern data warehouses. ByteHouse, ByteDance's cloud-native engine for extensive data analysis in exabyte-scale environments, serves numerous internal decision-making business scenarios. With the increasing demand for ByteHouse, cardinality estimation becomes the bottleneck for efficiently processing queries. Specifically, the existing query optimizer of ByteHouse uses the traditional Selinger-like cardinality estimator, which can produce substantial estimation errors, resulting in suboptimal query plans. To improve cardinality estimation accuracy while maintaining a practical inference overhead, we develop a framework ByteCard that enables efficient training and integration of learned cardinality estimators. Furthermore, ByteCard adapts recent advances in cardinality estimation to build models that can balance accuracy and practicality (e.g., inference latency, model size, training overhead). We observe significant query processing speed-up in ByteHouse after replacing the existing cardinality estimator with ByteCard for several optimization scenarios. Evaluations on real-world datasets show the integration of ByteCard leads to an improvement of up to 30% in the 99th quantile of latency. At last, we share our valuable experience in engineering advanced cardinality estimators. This experience can help ByteHouse integrate more learning-based solutions on the critical query execution path in the future.
- Column-oriented database systems. PVLDB 2, 2 (2009), 1664–1665.
- Materialization Strategies in a Column-Oriented DBMS. In ICDE. 466–475.
- Amazon Redshift re-invented. In SIGMOD. 2205–2217.
- David Beazley. 2010. Understanding the Python . In PyCON Python Conference. Atlanta, Georgia. 1–62.
- Language models are few-shot learners. NIPS 33 (2020), 1877–1901.
- Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. Journal of the American statistical Association 87, 417 (1992), 210–217.
- Towards estimation error guarantees for distinct values. In SIGMOD. 268–279.
- Workload-Aware Log-Structured Merge Key-Value Store for NVM-SSD Hybrid Storage. In ICDE. 2198–2210.
- CKCN Chow and Cong Liu. 1968. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory 14, 3 (1968), 462–467.
- Reuven Cohen and Yuval Nezri. 2019. Cardinality Estimation in a Virtualized Network Device Using Online Machine Learning. IEEE/ACM Transactions on Networking 27, 5 (2019), 2098–2110.
- The snowflake elastic data warehouse. In SIGMOD. 215–226.
- From {{\{{WiscKey}}\}} to Bourbon: A Learned Index for {{\{{Log-Structured}}\}} Merge Trees. In OSDI. 155–171.
- SageDB: An Instance-Optimized Data Analytics System. PVLDB 15, 13 (2022), 4062–4078.
- Chuong B Do and Serafim Batzoglou. 2008. What is the expectation maximization algorithm? Nature biotechnology 26, 8 (2008), 897–899.
- Selectivity estimation for range predicates using lightweight models. PVLDB 12, 9 (2019), 1044–1057.
- Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science. 137–156.
- aGrUM: a Graphical Universal Model framework. In IEA/AIE. 171–177.
- Goetz Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19–29.
- An autonomous materialized view management system with deep reinforcement learning. In ICDE. 2159–2164.
- Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation. PVLDB 15, 4 (2021), 752–765.
- Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT/ICDT. 683–692.
- Benjamin Hilprecht and Carsten Binnig. 2021. One model to rule them all: towards zero-shot learning for databases. arXiv:2105.00642 (2021).
- Benjamin Hilprecht and Carsten Binnig. 2022. Zero-Shot Cost Models for out-of-the-Box Learned Cost Prediction. PVLDB 15, 11 (2022), 2361–2374.
- DeepDB: learn from data, not from queries! PVLDB 13, 7 (2020), 992–1005.
- TiDB: A Raft-based HTAP Database. PVLDB 13, 12 (2020), 3072–3084.
- Zachary G Ives and Nicholas E Taylor. 2008. Sideways information passing for push-style query processing. In ICDE. 774–783.
- Learned cardinalities: Estimating correlated joins with deep learning. In CIDR.
- Daphne Koller and Nir Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press.
- The Vertica Analytic Database: C-Store 7 Years Later. PVLDB 5, 12 (2012), 1790–1801.
- Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsoft SQL Server. PVLDB 16, 11 (2023), 2871–2883.
- How good are query optimizers, really? PVLDB 9, 3 (2015), 204–215.
- Query optimization through the looking glass, and what we found running the join order benchmark. PVLDB 27, 5 (2018), 643–668.
- opengauss: An autonomous database system. PVLDB 14, 12 (2021), 3028–3042.
- Opportunistic view materialization with deep reinforcement learning. arXiv:1903.01363 (2019).
- Fauce: fast and accurate deep ensembles with uncertainty for cardinality estimation. PVLDB 14, 11 (2021), 1950–1963.
- H-A Loeliger. 2004. An introduction to factor graphs. IEEE Signal Processing Magazine 21, 1 (2004), 28–41.
- Robust Query Driven Cardinality Estimation under Changing Workloads. PVLDB 16, 6 (2023), 1520–1533.
- Constructing and Analyzing the LSM Compaction Design Space. PVLDB 14, 11 (2021), 2216–2229.
- Auto-WLM: Machine learning enhanced workload management in Amazon Redshift. In Companion of the International Conference on Management of Data, SIGMOD/PODS. 225–237.
- Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3, 1-2 (2010), 460–471.
- Access path selection in a relational database management system. In SIGMOD. 23–34.
- Cost-based optimization for magic: Algebra and implementation. In SIGMOD. 435–446.
- Rover: An online Spark SQL tuning service via generalized transfer learning. In SIGKDD. 4800–4812.
- Materialization Strategies in the Vertica Analytic Database: Lessons Learned. In ICDE. 1196–1207.
- Cost models for big data query processing: Learning, retrofitting, and our findings. In SIGMOD. 99–113.
- Ji Sun and Guoliang Li. 2019. An End-to-End Learning-Based Cost Estimator. PVLDB 13, 3 (2019), 307–319.
- Learned cardinality estimation: A design space exploration and a comparative evaluation. PVLDB 15, 1 (2021), 85–97.
- Presto: A Decade of SQL Analytics at Meta. SIGMOD 1, 2 (2023), 1–25.
- XIndex: a scalable learned index for multicore data storage. In PPoPP. 308–320.
- Prediction Intervals for Learned Cardinality Estimation: An Experimental Evaluation. In ICDE. 3051–3064.
- Automatic database management system tuning through large-scale machine learning. In SIGMOD. 1009–1024.
- FACE: A normalizing flow based cardinality estimator. PVLDB 15, 1 (2021), 72–84.
- Emergent abilities of large language models. arXiv:2206.07682 (2022).
- Learning to Be a Statistician: Learned Estimator for Number of Distinct Values. PVLDB 15, 2 (2021), 272–284.
- FactorJoin: A New Cardinality Estimation Framework for Join Queries. SIGMOD 1, 1 (2023), 1–27.
- Ziniu Wu and Amir Shaikhha. 2020. BayesCard: A Unified Bayesian Framework for Cardinality Estimation. arXiv:2012.14743 (2020).
- A Unified Transferable Model for ML-Enhanced DBMS. CIDR (2022).
- NeuroCard: One Cardinality Estimator for All Tables. PVLDB 14, 1 (2021), 61–73.
- Deep unsupervised cardinality estimation. PVLDB 13, 3 (2019), 279–292.
- AnalyticDB: real-time OLAP database system at Alibaba cloud. PVLDB 12, 12 (2019), 2059–2070.
- An end-to-end automatic cloud database tuning system using deep reinforcement learning. In SIGMOD. 415–432.
- ResTune: Resource Oriented Tuning Boosted by Meta-Learning for Cloud Databases. In SIGMOD. 2102–2114.
- FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. PVLDB 14, 9 (2021), 1489–1502.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.