Query Rewriting via Large Language Models (2403.09060v1)
Abstract: Query rewriting is one of the most effective techniques for coping with poorly written queries before passing them down to the query optimizer. Manual rewriting is not scalable, as it is error-prone and requires deep expertise. Similarly, traditional query rewriting algorithms can only handle a small subset of queries: rule-based techniques do not generalize to new query patterns and synthesis-based techniques cannot handle complex queries. Fortunately, the rise of LLMs, equipped with broad general knowledge and advanced reasoning capabilities, has created hopes for solving some of these previously open problems. In this paper, we present GenRewrite, the first holistic system that leverages LLMs for query rewriting. We introduce the notion of Natural Language Rewrite Rules (NLR2s), and use them as hints to the LLM but also a means for transferring knowledge from rewriting one query to another, and thus becoming smarter and more effective over time. We present a novel counterexample-guided technique that iteratively corrects the syntactic and semantic errors in the rewritten query, significantly reducing the LLM costs and the manual effort required for verification. GenRewrite speeds up 22 out of 99 TPC queries (the most complex public benchmark) by more than 2x, which is 2.5x--3.2x higher coverage than state-of-the-art traditional query rewriting and 2.1x higher than the out-of-the-box LLM baseline.
- dbt Labs — Transform Data in Your Warehouse. https://www.getdbt.com/.
- LeetCode. https://leetcode.com/problemset/database/.
- Looker business intelligence platform embedded analytics. https://cloud.google.com/looker.
- OpenAI Introduction. https://platform.openai.com/docs/introduction/tokens.
- Prompt Engineering for Generative AI. https://developers.google.com/machine-learning/resources/prompt-eng.
- TPC-DS Benchmark. https://www.tpc.org/tpcds/.
- A comparative survey of recent natural language interfaces for databases. The VLDB Journal 28 (2019), 793–819.
- Cost-based query transformation in Oracle. In VLDB, Vol. 6. 1026–1036.
- Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. In Proceedings of the 2018 International Conference on Management of Data. 221–230.
- Computation reuse via fusion in Amazon Athena. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 1610–1620.
- Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. arXiv:2305.15778 [cs.SE]
- The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data. 215–226.
- ARe-SQL: An Online, Automatic and Non-Intrusive Approach for Rewriting SQL Queries. Journal of Information and Data Management 5, 1 (2014), 28–28.
- Proving Query Equivalence Using Linear Integer Arithmetic. Proceedings of the ACM on Management of Data 1, 4 (2023), 1–26.
- SlabCity: Whole-Query Optimization Using Program Synthesis. Proceedings of the VLDB Endowment 16, 11 (2023), 3151–3164.
- Sérgio Fernandes and Jorge Bernardino. 2015. What is bigquery?. In Proceedings of the 19th International Database Engineering & Applications Symposium. 202–203.
- CatSQL: Towards Real World Natural Language to SQL Applications. Proceedings of the VLDB Endowment 16, 6 (2023), 1534–1547.
- Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. CoRR abs/2308.15363 (2023). arXiv preprint arXiv:2308.15363 (2023).
- Goetz Graefe. 1987. Rule-based query optimization in extensible database systems. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.
- Baseball: an automatic question-answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference. 219–224.
- Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning. Proceedings of the ACM on Management of Data 1, 2 (2023), 1–28.
- Towards complex text-to-sql in cross-domain database with intermediate representation. arXiv preprint arXiv:1905.08205 (2019).
- Selecting subexpressions to materialize at datacenter scale. Proceedings of the VLDB Endowment 11, 7 (2018), 800–812.
- George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL. The VLDB Journal (2023), 1–32.
- Natural language to SQL: Where are we today? Proceedings of the VLDB Endowment 13, 10 (2020), 1737–1750.
- Alon Yitzchak Levy and Inderpal Singh Mumick. 1997. Query optimization by predicate move-around. US Patent 5,659,725.
- Table-gpt: Table-tuned gpt for diverse table tasks. arXiv preprint arXiv:2310.09263 (2023).
- Natural language data management and interfaces. Springer.
- Large Language Model for Table Processing: A Survey. arXiv preprint arXiv:2402.05121 (2024).
- M Muralikrishna et al. 1992. Improved unnesting algorithms for join aggregate SQL queries. In VLDB, Vol. 92. Citeseer, 91–102.
- Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911 (2022).
- Extensible/rule based query rewrite optimization in Starburst. ACM Sigmod Record 21, 2 (1992), 39–48.
- Mohammadreza Pourreza and Davood Rafiei. 2024. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Advances in Neural Information Processing Systems 36 (2024).
- Cost-based optimization for magic: Algebra and implementation. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data. 435–446.
- Learning contextual representations for semantic parsing with generation-augmented pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13806–13814.
- Annotating columns with pre-trained language models. In Proceedings of the 2022 International Conference on Management of Data. 1493–1503.
- Sql-palm: Improved large language model adaptation for text-to-sql. CoRR, abs/2306.00739, 2023a. doi: 10.48550. arXiv preprint ARXIV.2306.00739 ([n. d.]).
- A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313 (2024).
- Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. arXiv preprint arXiv:1911.04942 (2019).
- A natural language interface for database: Achieving transfer-learnability using adversarial method for question understanding. In 2020 IEEE 36th International conference on data engineering (ICDE). IEEE, 97–108.
- Wetune: Automatic discovery and verification of query rewrite rules. In Proceedings of the 2022 International Conference on Management of Data. 94–107.
- Guoliang Li Xuanhe Zhou, Zhaoyan Sun. 2023. DB-GPT: Large Language Model Meets Database.
- Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023).
- A learned query rewrite system using monte carlo tree search. Proceedings of the VLDB Endowment 15, 1 (2021), 46–58.
- D-bot: Database diagnosis system using large language models. arXiv preprint arXiv:2312.01454 (2023).
- Jie Liu (492 papers)
- Barzan Mozafari (14 papers)