Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRAD -- Extended Version (2407.15363v1)
Abstract: Modern organizations manage their data with a wide variety of specialized cloud database engines (e.g., Aurora, BigQuery, etc.). However, designing and managing such infrastructures is hard. Developers must consider many possible designs with non-obvious performance consequences; moreover, current software abstractions tightly couple applications to specific systems (e.g., with engine-specific clients), making it difficult to change after initial deployment. A better solution would virtualize cloud data management, allowing developers to declaratively specify their workload requirements and rely on automated solutions to design and manage the physical realization. In this paper, we present a technique called blueprint planning that achieves this vision. The key idea is to project data infrastructure design decisions into a unified design space (blueprints). We then systematically search over candidate blueprints using cost-based optimization, leveraging learned models to predict the utility of a blueprint on the workload. We use this technique to build BRAD, the first cloud data virtualization system. BRAD users issue queries to a single SQL interface that can be backed by multiple cloud database services. BRAD automatically selects the most suitable engine for each query, provisions and manages resources to minimize costs, and evolves the infrastructure to adapt to workload shifts. Our evaluation shows that BRAD meet user-defined performance targets and improve cost-savings by 1.6-13x compared to serverless auto-scaling or HTAP systems.
- Proteus: Autonomous Adaptive Storage for Mixed Workloads. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD ’22). 700–714. https://doi.org/10.1145/3514221.3517834
- Ivo Adan and Jacques Resing. 2015. Queueing Systems. https://www.win.tue.nl/~iadan/queueing.pdf.
- RHEEM: Enabling Cross-Platform Data Processing: May the Big Data Be with You! Proceedings of the VLDB Endowment 11, 11 (2018), 1414–1427. https://doi.org/10.14778/3236187.3236195
- Learning-based Query Performance Modeling and Prediction. In Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE ’12). 390–401. https://doi.org/10.1109/ICDE.2012.64
- Towards Scalable Hybrid Stores: Constraint-Based Rewriting to the Rescue. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). 1660–1677.
- Amazon Web Services. 2021. Achieve up to 35% better price/performance with Amazon Aurora using new Graviton2 instances. https://aws.amazon.com/about-aws/whats-new/2021/03/achieve-up-to-35-percent-better-price-performance-with-amazon-aurora-using-new-graviton2-instances/.
- Amazon Web Services. 2022. AWS announces Amazon Aurora zero-ETL integration with Amazon Redshift . https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-aurora-zero-etl-integration-redshift/. Retrieved July 20, 2024.
- Amazon Web Services. 2023a. AWS announces Amazon Aurora I/O-Optimized. https://aws.amazon.com/about-aws/whats-new/2023/05/amazon-aurora-i-o-optimized/. Retrieved July 20, 2024.
- Amazon Web Services. 2023b. How do I resize an Amazon Redshift cluster? https://repost.aws/knowledge-center/resize-redshift-cluster. Retrieved July 20, 2024.
- Amazon Web Services. 2024a. Amazon Athena. https://aws.amazon.com/athena/. Retrieved July 20, 2024.
- Amazon Web Services. 2024b. Amazon Athena Pricing. https://aws.amazon.com/athena/pricing/. Retrieved July 20, 2024.
- Amazon Web Services. 2024c. Amazon Aurora. https://aws.amazon.com/rds/aurora/. Retrieved July 20, 2024.
- Amazon Web Services. 2024d. Amazon Aurora Pricing. https://aws.amazon.com/rds/aurora/pricing/. Retrieved July 20, 2024.
- Amazon Web Services. 2024e. Amazon EC2. https://aws.amazon.com/ec2/. Retrieved July 20, 2024.
- Amazon Web Services. 2024f. Amazon Redshift. https://aws.amazon.com/redshift/. Retrieved July 20, 2024.
- Amazon Web Services. 2024g. Amazon Redshift Pricing. https://aws.amazon.com/redshift/pricing/. Retrieved July 20, 2024.
- Amazon Web Services. 2024h. Amazon S3. https://aws.amazon.com/s3/. Retrieved July 20, 2024.
- Amazon Web Services. 2024i. AWS CloudFormation. https://aws.amazon.com/pm/cloudformation/. Retrieved July 20, 2024.
- Amazon Web Services. 2024j. AWS RDS Proxy. https://aws.amazon.com/rds/proxy/. Retrieved July 20, 2024.
- Amazon Web Services. 2024k. Data Lakes and Analytics on AWS. https://aws.amazon.com/big-data/datalakes-and-analytics/. Retrieved July 20, 2024.
- Amazon Web Services. 2024l. Purpose-Built Databases on AWS. https://aws.amazon.com/products/databases/. Retrieved July 20, 2024.
- Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference (AFIPS ’67 (Spring)). 483–485. https://doi.org/10.1145/1465482.1465560
- Rapid Adoption of Cloud Data Warehouse Technology Using Datometry Hyper-Q. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). 825–839. https://doi.org/10.1145/3183713.3190652
- Pgbouncer Authors. 2024. Pgbouncer - Lightweight connection pooler for PostgreSQL. https://www.pgbouncer.org/. Retrieved July 20, 2024.
- A Dynamic Distributed Federated Database. In Proceedings of the 2nd Annual Conference on International Technology Alliance (ACITA ’08).
- Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.
- Overview of Multidatabase Transaction Management. VLDB Journal 1 (10 1992), 181–239. https://doi.org/10.1145/1925805.1925811
- Yuri Breitbart and Avi Silberschatz. 1988. Multidatabase Update Issues. In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data (SIGMOD ’88). 135–142. https://doi.org/10.1145/50202.50217
- Tigger: A Database Proxy That Bounces with User-Bypass. Proceedings of the VLDB Endowment 16, 11 (2023), 3335–3348. https://doi.org/10.14778/3611479.3611530
- The Sky Above The Clouds. arXiv:2205.07147 [cs.DC] https://arxiv.org/abs/2205.07147
- New Directions in Cloud Programming. arXiv:2101.01159 [cs.DC] https://arxiv.org/abs/2101.01159
- SLA-tree: A Framework for Efficiently Supporting SLA-based Decisions in Cloud Computing. In Proceedings of the 14th International Conference on Extending Database Technology (EDBT ’11). 129–140. https://doi.org/10.1145/1951365.1951383
- cppreference.com. 2024. C++ named requirements: Compare. https://en.cppreference.com/w/cpp/named_req/Compare. Retrieved July 20, 2024.
- The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). 215–226. https://doi.org/10.1145/2882903.2903741
- Instance-Optimized Data Layouts for Cloud Analytics Workloads. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD ’21). 418–431. https://doi.org/10.1145/3448016.3457270
- Tsunami: A Learned Multi-Dimensional Index for Correlated Data and Skewed Workloads. Proceedings of the VLDB Endowment 14, 2 (2020), 74–86. https://doi.org/10.14778/3425879.3425880
- The BigDAWG Polystore System. SIGMOD Rec. 44, 2 (August 2015), 11–16. https://doi.org/10.1145/2814710.2814713
- Contender: A Resource Modeling Approach for Concurrent Query Performance Prediction. In Proceedings of the 17th International Conference on Extending Database Technology (EDBT ’14). 109–120. https://doi.org/10.5441/002/EDBT.2014.11
- Johannes Frnkranz and Eyke Hllermeier. 2010. Preference Learning. Springer-Verlag, Berlin, Heidelberg.
- The SAP HANA Database – An Architecture Overview. IEEE Data Engineering Bulletin 35 (03 2012), 28–33.
- On Serializability of Multidatabase Transactions Through Forced Local Conflicts. In Proceedings of the Seventh International Conference on Data Engineering (ICDE ’91). 314–323.
- Neural Message Passing for Quantum Chemistry. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 70. PMLR, 1263–1272. https://proceedings.mlr.press/v70/gilmer17a.html
- Google, Inc. 2024a. BigQuery Omni. https://cloud.google.com/bigquery/docs/omni-introduction. Retrieved July 20, 2024.
- Google, Inc. 2024b. Google Cloud Databases. https://cloud.google.com/products/databases. Retrieved July 20, 2024.
- Google, Inc. 2024c. Google Compute Engine. https://cloud.google.com/compute. Retrieved July 20, 2024.
- Data and Analytics Cloud Adoption Survey Reveals Data Governance and Cost Challenges. Gartner Report. https://www.gartner.com/document/5106731.
- Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation. Proceedings of the VLDB Endowment 15, 4 (2021), 752–765. https://doi.org/10.14778/3503585.3503586
- Mor Harchol-Balter. 2013. Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press.
- HashiCorp. 2024. Terraform. https://www.terraform.io. Retrieved July 20, 2024.
- Benjamin Hilprecht and Carsten Binnig. 2022. Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction. Proceedings of the VLDB Endowment 15, 11 (2022), 2361–2374. https://www.vldb.org/pvldb/vol15/p2361-hilprecht.pdf
- TiDB: A Raft-Based HTAP Database. Proceedings of the VLDB Endowment 13, 12 (2020), 3072–3084. https://doi.org/10.14778/3415478.3415535
- The MYRIAD Federated Database Prototype. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD ’94). https://doi.org/10.1145/191839.191986
- Garlic: A New Flavor of Federated Query Processing for DB2. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD ’02). 524–532.
- LlamaTune: Sample-Efficient DBMS Configuration Tuning. Proceedings of the VLDB Endowment 15, 11 (2022), 2953–2965. https://doi.org/10.14778/3551793.3551844
- Scheduling Strategies for Efficient ETL Execution. Information Systems 38, 6 (2013), 927–945.
- Alfons Kemper and Thomas Neumann. 2011. HyPer: A Hybrid OLTP & OLAP Main Memory Database System Based on Virtual Memory Snapshots. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE ’11). 195–206. https://doi.org/10.1109/ICDE.2011.5767867
- Extract-Transform-Load for Video Streams. Proceedings of the VLDB Endowment 16, 9 (2023), 2302–2315. https://doi.org/10.14778/3598581.3598600
- SageDB: A Learned Database System. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR ’19). http://cidrdb.org/cidr2019/papers/p117-kraska-cidr19.pdf
- The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). 489–504. https://doi.org/10.1145/3183713.3196909
- Check Out the Big Brain on BRAD: Simplifying Cloud Data Processing with Learned Automated Data Meshes. Proceedings of the VLDB Endowment 16, 11 (8 2023), 3293–3301. https://doi.org/10.14778/3611479.3611526
- Learning to Optimize Join Queries With Deep Reinforcement Learning. arXiv:1808.03196 [cs.DB] https://arxiv.org/abs/1808.03196
- Oracle Database In-Memory: A Dual Format In-Memory Database. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering (ICDE ’15). 1253–1258. https://doi.org/10.1109/ICDE.2015.7113373
- How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (2015), 204–215.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems (NeurIPS ’20).
- Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques. Proceedings of the VLDB Endowment 5, 11 (2012), 1555–1566. https://doi.org/10.14778/2350229.2350269
- DARQ Matter Binds Everything: Performant and Composable Cloud Programming via Resilient Steps. Proceedings of the ACM on Management of Data 1, 2, Article 117 (2023), 27 pages. https://doi.org/10.1145/3589262
- Serverless State Management Systems. In Proceedings of the Conference on Innovative Data Research (CIDR ’24). https://www.cidrdb.org/cidr2024/papers/p16-li.pdf
- Database Gyms. In Proceedings of the Conference on Innovative Data Systems Research (CIDR ’23).
- B. T. Lowerre. 1976. The HARPY Speech Recognition System. Ph.D. Dissertation. Carnegie Mellon University.
- Bao: Making Learned Query Optimization Practical. In Proceedings of the International Conference on Management of Data (SIGMOD ’22).
- Neo: A Learned Query Optimizer. Proceedings of the VLDB Endowment 12, 11 (2019).
- Ryan Marcus and Olga Papaemmanouil. 2018. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM ’18).
- Ryan Marcus and Olga Papaemmanouil. 2019. Plan-Structured Deep Neural Network Models for Query Performance Prediction. Proceedings of the VLDB Endowment 12, 11 (2019), 1733–1746. https://doi.org/10.14778/3342263.3342646
- Microsoft Corporation. 2024a. Azure Compute. https://azure.microsoft.com/en-us/products/category/compute. Retrieved July 20, 2024.
- Microsoft Corporation. 2024b. Microsoft Fabric Documentation. https://learn.microsoft.com/en-us/fabric/. Retrieved July 20, 2024.
- How Good is My HTAP System?. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD ’22). 1810–1824.
- Ray: A Distributed Framework for Emerging AI Applications. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI ’18). 561–577.
- Making Data Clouds Smarter at Keebo: Automated Warehouse Optimization Using Data Learning. In Companion of the 2023 International Conference on Management of Data (Seattle, WA, USA) (SIGMOD ’23). Association for Computing Machinery, New York, NY, USA, 239–251. https://doi.org/10.1145/3555041.3589681
- Performance and Resource Modeling in Highly-Concurrent OLTP Workloads. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13).
- Learning Multi-Dimensional Indexes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 985–1000. https://doi.org/10.1145/3318464.3380579
- Robust Query Driven Cardinality Estimation under Changing Workloads. Proceedings of the VLDB Endowment 16, 6 (2023), 1520–1533. https://doi.org/10.14778/3583140.3583164
- Jennifer Ortiz. 2019. Performance-Based Service Level Agreements for Data Analytics in the Cloud. Ph.D. Dissertation. University of Washington.
- SLAOrchestrator: Reducing the Cost of Performance SLAs for Cloud Data Analytics. In Proceedings of the 2018 USENIX Annual Technical Conference ((USENIX ATC ’18)). 547–560.
- PerfEnforce: A Dynamic Scaling Engine for Analytics with Performance Guarantees. arXiv:1605.09753 [cs.DB]
- OtterTune, Inc. 2024. OtterTune — AI Powered Automatic PostgreSQL & MySQL Tuning. https://web.archive.org/web/20240605143522/https://ottertune.com/. Retrieved July 20, 2024.
- Self-Driving Database Management Systems. In Proceedings of the Conference on Innovative Data Systems Research (CIDR ’17). https://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf
- External vs. Internal: An Essay on Machine Learning Agents for Autonomous Database Management Systems. IEEE Data Engineering Bulletin (June 2019), 32–46. http://sites.computer.org/debull/A19june/p32.pdf
- Make Your Database System Dream of Electric Sheep: Towards Self-Driving Operation. Proceedings of the VLDB Endowment 14, 12 (2021), 3211–3221. https://doi.org/10.14778/3476311.3476411
- Cackle: Analytical Workload Cost and Performance Stability With Elastic Pools. In Proceedings of the ACM on Management of Data, Vol. 1. Issue 4. https://doi.org/10.1145/3626720
- pgvector Authors. 2023. Open source vector similarity search for Postgres. https://github.com/pgvector/pgvector.
- Maksim Podkorytov and Michael Gubanov. 2019. Hybrid.Poly: A Consolidated Interactive Analytical Polystore System. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE ’19). 1996–1999. https://doi.org/10.1109/ICDE.2019.00223
- Calton Pu. 1988. Superdatabases for Composition of Heterogeneous Databases. In Proceedings of the Fourth International Conference on Data Engineering (ICDE ’88). 548–555.
- INFaaS: Automated Model-less Inference Serving. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC ’21). 397–411. https://www.usenix.org/conference/atc21/presentation/romero
- Mingwei Samuel. 2021. Hydroflow: A Model and Runtime for Distributed Systems Programming. Master’s thesis. University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-201.html
- Auto-WLM: Machine Learning Enhanced Workload Management in Amazon Redshift. In Companion of the 2023 International Conference on Management of Data (SIGMOD ’23). 225–237. https://doi.org/10.1145/3555041.3589677
- Access Path Selection in a Relational Database Management System. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data (SIGMOD ’79). 23–34. https://doi.org/10.1145/582095.582099
- Amit P Sheth and James A Larson. 1990. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing Surveys (CSUR) 22, 3 (1990), 183–236.
- Efficient Transaction Processing in SAP HANA Database: The End of a Column Store Myth. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD ’12). 731–742. https://doi.org/10.1145/2213836.2213946
- Snowflake, Inc. 2024. ETL vs. ELT: Differences and Similarities. https://www.snowflake.com/guides/etl-vs-elt. Retrieved July 20, 2024.
- Ji Sun and Guoliang Li. 2019. An End-to-End Learning-based Cost Estimator. Proceedings of the VLDB Endowment 13, 3 (2019), 307–319. https://doi.org/10.14778/3368289.3368296
- Transaction Processing Performance Council (TPC). 2024a. TPC-C. https://www.tpc.org/tpcc/. Retrieved July 20, 2024.
- Transaction Processing Performance Council (TPC). 2024b. TPC-DS. https://www.tpc.org/tpcds/default5.asp. Retrieved July 20, 2024.
- Transaction Processing Performance Council (TPC). 2024c. TPC-H. https://www.tpc.org/tpch/default5.asp. Retrieved July 20, 2024.
- University of California, Berkeley. 2024. Sky Computing. https://sky.cs.berkeley.edu/. Retrieved July 20, 2024.
- Automatic Database Management System Tuning Through Large-Scale Machine Learning. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). 1009–1024. https://doi.org/10.1145/3035918.3064029
- Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’16). 363–378.
- Polypheny-DB: Towards a Distributed and Self-Adaptive Polystore. In Proceedings of the 2018 IEEE International Conference on Big Data (IEEE Big Data ’18). 3364–3373.
- Building An Elastic Query Engine on Disaggregated Storage. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’20). 449–462. https://www.usenix.org/conference/nsdi20/presentation/vuppalapati
- Self-Tuning Query Scheduling for Analytical Workloads. In Proceedings of the International Conference on Management of Data (SIGMOD ’21). 1879–1891. https://doi.org/10.1145/3448016.3457260
- The Myria Big Data Management and Analytics System and Cloud Services. In Proceedings of the Conference on Innovative Data Systems Research (CIDR ’17).
- Predicting Query Execution Time: Are Optimizer Cost Models Really Unusable?. In Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE ’13). 1081–1092. https://doi.org/10.1109/ICDE.2013.6544899
- Stage: Query Execution Time Prediction in Amazon Redshift. In Companion of the 2024 International Conference on Management of Data (SIGMOD ’24). 280–294. https://doi.org/10.1145/3626246.3653391
- FactorJoin: A New Cardinality Estimation Framework for Join Queries. Proceedings of the ACM on Management of Data 1, 1, Article 41 (2023), 27 pages. https://doi.org/10.1145/3588721
- A Unified Transferable Model for ML-Enhanced DBMS. In Proceedings of the 12th Conference on Innovative Data Systems Research (CIDR ’22). https://www.cidrdb.org/cidr2022/papers/p6-wu.pdf
- TreeLine: An Update-In-Place Key-Value Store for Modern Storage. Proceedings of the VLDB Endowment 16, 1 (2022), 99–112.
- Reinforcement Learning with Tree-LSTM for Join Order Selection. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE ’20). 1297–1308.
- Skeena: Efficient and Consistent Cross-Engine Transactions. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD ’22). 34–48. https://doi.org/10.1145/3514221.3526171
- AWESOME: Empowering Scalable Data Science on Social Media Data with an Optimized Tri-Store Data System. arXiv:2112.00833 [cs.DB]
- Lero: A Learning-to-Rank Query Optimizer. Proceedings of the VLDB Endowment 16, 6 (2023), 1466–1479. https://doi.org/10.14778/3583140.3583160