Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency (2404.12872v1)

Published 19 Apr 2024 in cs.DB and cs.CL
LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency

Abstract: Query rewrite, which aims to generate more efficient queries by altering a SQL query's structure without changing the query result, has been an important research problem. In order to maintain equivalence between the rewritten query and the original one during rewriting, traditional query rewrite methods always rewrite the queries following certain rewrite rules. However, some problems still remain. Firstly, existing methods of finding the optimal choice or sequence of rewrite rules are still limited and the process always costs a lot of resources. Methods involving discovering new rewrite rules typically require complicated proofs of structural logic or extensive user interactions. Secondly, current query rewrite methods usually rely highly on DBMS cost estimators which are often not accurate. In this paper, we address these problems by proposing a novel method of query rewrite named LLM-R2, adopting a LLM to propose possible rewrite rules for a database rewrite system. To further improve the inference ability of LLM in recommending rewrite rules, we train a contrastive model by curriculum to learn query representations and select effective query demonstrations for the LLM. Experimental results have shown that our method can significantly improve the query execution efficiency and outperform the baseline methods. In addition, our method enjoys high robustness across different datasets.

Enhanced SQL Query Rewriting using LLMs

Introduction to LLM-R² System

The LLM-R² system introduces a transformative approach to SQL query rewriting by integrating LLMs to suggest rewrite rules that can be applied within a database system. Traditional query rewrite systems rely heavily on pre-defined rules, limiting their effectiveness and adaptability. To address these limitations, LLM-R² employs a novel methodology that utilizes the capabilities of LLMs to propose potential rewrite rules, which are then applied using established database platforms. This approach ensures the executability and equivalence of the rewritten queries by relying on validated rewrite rules, while significantly improving query execution efficiency.

System Design and Implementation

General Workflow

The overall architecture of the LLM-R² system is designed to leverage LLMs for enhancing the rule-based query rewrite process. The system processes SQL queries by prompting an LLM with the original query and a set of potential rewrite rules. It then uses the LLM’s suggestions to apply the most effective rules using a regular database rewrite engine.

Demonstration Manager

A central component of the LLM-R² system is the Demonstration Manager. This module optimizes the selection of in-context demonstrations, which are crucial for guiding the LLM in generating useful rewrite rules. The manager functions in two main phases:

  • Demonstration Preparation: This stage involves generating a pool of effective rewrite examples using existing methods. It assesses the impact of various rewrite strategies on query performance, ensuring a rich collection of high-quality rewrites for training and application.
  • Demonstration Selection: At this stage, a model is trained to select the most appropriate demonstration for any given input query. This selection is crucial as it influences the LLM’s ability to propose effective rewrite rules.

Experimental Evaluation

Setup and Datasets

The LLM-R² system was evaluated using three benchmark datasets: TPC-H, IMDB, and DSB, encompassing a variety of query complexities and data scales. Comparative experiments were conducted against traditional rule-based methods and a baseline LLM-only approach.

Results

The experimental results confirmed that LLM-R² significantly reduces the execution time of SQL queries compared to both the original queries and those rewritten by baseline methods. Notably, the system demonstrated robust performance across all tested datasets, often outperforming traditional methods by a substantial margin.

Theoretical and Practical Implications

The introduction of LLM-R² has several implications for both theory and practice in database query processing:

  • Theoretical: LLM-R² challenges conventional rule-based rewrite systems by introducing a model that combines the theoretical underpinnings of LLMs with the practical application of database management systems. This hybrid approach opens new avenues for research into intelligent query optimization.
  • Practical: For practitioners, LLM-R² offers a more dynamic and effective tool for query rewriting, capable of adapting to a variety of database schemas and query structures without the need for extensive rule redefinition.

Future Directions

Given the promising results obtained with LLM-R², future research could explore several avenues:

  • Model Enhancement: Further refining the model's demonstration selection phase could yield even greater efficiencies in query rewriting.
  • LLM Integration: Exploring the integration of other LLM architectures or custom-trained models specifically optimized for SQL contexts could improve both the efficiency and accuracy of rewrites.
  • Broadened Application: Extending the LLM-R² approach to other areas of database management, such as automatic indexing or query dispatching, could significantly enhance overall system performance.

Conclusion

The LLM-R² system represents a significant step forward in the field of SQL query rewriting. By effectively integrating LLMs into the rule-based rewrite process, it offers substantial improvements in query execution efficiency while maintaining the high standards of executability and equivalence required in database systems. This innovative approach not only enhances current database management practices but also sets the stage for further developments in intelligent database systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. [n.d.]. Apache Calcite Rewrite Rules. https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/rules/package-summary.html.
  2. [n.d.]. Introduction of OpenAI Text Generation APIs. https://platform.openai.com/docs/guides/text-generation.
  3. [n.d.]. LLM As Database Administrator. https://github.com/TsinghuaDatabaseGroup/DB-GPT.
  4. [n.d.]. PostgreSQL. https://www.postgresql.org.
  5. [n.d.]. TPC-H Toolkit. https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp.
  6. QueryBooster: Improving SQL Performance Using Middleware Services for Human-Centered Query Rewriting. arXiv:2305.08272 [cs.DB]
  7. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD/PODS ’18). ACM. https://doi.org/10.1145/3183713.3190662
  8. Curriculum Learning. In Proceedings of the 26th Annual International Conference on Machine Learning (Montreal, Quebec, Canada) (ICML ’09). Association for Computing Machinery, New York, NY, USA, 41–48. https://doi.org/10.1145/1553374.1553380
  9. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
  10. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics. https://doi.org/10.18653/v1/s17-2001
  11. DSB: A Decision Support Benchmark for Workload-Driven and Traditional Database Systems. In VLDB 2022. https://www.microsoft.com/en-us/research/publication/dsb-a-decision-support-benchmark-for-workload-driven-and-traditional-database-systems/
  12. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
  13. SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv:2104.08821 [cs.CL]
  14. Goetz Graefe. 1995. The Cascades Framework for Query Optimization. IEEE Data(base) Engineering Bulletin 18 (1995), 19–29. https://api.semanticscholar.org/CorpusID:260706023
  15. Goetz Graefe and David J. DeWitt. 1987. The EXODUS optimizer generator. In Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’87). Association for Computing Machinery, New York, NY, USA, 160–172. https://doi.org/10.1145/38713.38734
  16. G. Graefe and W.J. McKenna. 1993. The Volcano optimizer generator: extensibility and efficient search. In Proceedings of IEEE 9th International Conference on Data Engineering. 209–218. https://doi.org/10.1109/ICDE.1993.344061
  17. Survey of Hallucination in Natural Language Generation. Comput. Surveys 55, 12 (March 2023), 1–38. https://doi.org/10.1145/3571730
  18. How Good Are Query Optimizers, Really? Proc. VLDB Endow. 9 (2015), 204–215. https://api.semanticscholar.org/CorpusID:7953847
  19. Feifei Li. 2019. Cloud-Native Database Systems at Alibaba: Opportunities and Challenges. Proc. VLDB Endow. 12, 12 (aug 2019), 2263–2272. https://doi.org/10.14778/3352063.3352141
  20. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. arXiv:2305.03111 [cs.CL]
  21. Unified Demonstration Retriever for In-Context Learning. arXiv:2305.04320 [cs.CL]
  22. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 142–150. http://www.aclweb.org/anthology/P11-1015
  23. Extensible/Rule Based Query Rewrite Optimization in Starburst. SIGMOD Rec. 21, 2 (jun 1992), 39–48. https://doi.org/10.1145/141484.130294
  24. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
  25. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL]
  26. SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL. arXiv:2306.00739 [cs.CL]
  27. WeTune: Automatic Discovery and Verification of Query Rewrite Rules. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 94–107. https://doi.org/10.1145/3514221.3526125
  28. Larger language models do in-context learning differently. arXiv:2303.03846 [cs.CL]
  29. Factor Windows: Cost-based Query Rewriting for Optimizing Correlated Window Aggregates. arXiv:2008.12379 [cs.DB]
  30. DB-GPT: Empowering Database Interactions with Private Large Language Models. arXiv:2312.17449 [cs.DB]
  31. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]
  32. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv:2309.01219 [cs.CL]
  33. QueryFormer: a tree transformer model for query plan representation. Proceedings of the VLDB Endowment 15 (04 2022), 1658–1670. https://doi.org/10.14778/3529337.3529349
  34. A Comparative Study and Component Analysis of Query Plan Representation Techniques in ML4DB Studies. Proc. VLDB Endow. 17, 4 (2024).
  35. A Learned Query Rewrite System Using Monte Carlo Tree Search. Proc. VLDB Endow. 15, 1 (sep 2021), 46–58. https://doi.org/10.14778/3485450.3485456
  36. R3superscript𝑅3R^{3}italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-NL2GQL: A Hybrid Models Approach for for Accuracy Enhancing and Hallucinations Mitigation. arXiv:2311.01862 [cs.CL]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhaodonghui Li (4 papers)
  2. Haitao Yuan (14 papers)
  3. Huiming Wang (8 papers)
  4. Gao Cong (54 papers)
  5. Lidong Bing (144 papers)