Analyzing the Effectiveness of Large Language Models on Text-to-SQL Synthesis

Published 22 Jan 2024 in cs.AI, cs.DB, and cs.PL | (2401.12379v1)

Abstract: This study investigates various approaches to using LLMs for Text-to-SQL program synthesis, focusing on the outcomes and insights derived. Employing the popular Text-to-SQL dataset, spider, the goal was to input a natural language question along with the database schema and output the correct SQL SELECT query. The initial approach was to fine-tune a local and open-source model to generate the SELECT query. After QLoRa fine-tuning WizardLM's WizardCoder-15B model on the spider dataset, the execution accuracy for generated queries rose to a high of 61%. With the second approach, using the fine-tuned gpt-3.5-turbo-16k (Few-shot) + gpt-4-turbo (Zero-shot error correction), the execution accuracy reached a high of 82.1%. Of all the incorrect queries, most can be categorized into a seven different categories of what went wrong: selecting the wrong columns or wrong order of columns, grouping by the wrong column, predicting the wrong values in conditionals, using different aggregates than the ground truth, extra or too few JOIN clauses, inconsistencies in the Spider dataset, and lastly completely incorrect query structure. Most if not all of the queries fall into these categories and it is insightful to understanding where the faults still lie with LLM program synthesis and where they can be improved.

Abstract PDF HTML Upgrade to Chat

Authors (3)

References (7)

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that closed-source LLMs with few-shot in-context learning outperform open-source models in generating correct SQL queries.
It employs LoRA fine-tuning and error-driven correction techniques, achieving a peak execution accuracy of 82.1% on the spider dataset.
The study identifies seven error categories and reveals evaluation script flaws, emphasizing the need for refined assessment protocols.

Introduction

Text-to-SQL program synthesis has advanced significantly with the introduction of LLMs, transforming the landscape of querying databases using natural language (NL). Studies have previously utilized smaller pre-trained models; however, recent developments leverage large parameter pre-trained LLMs that have demonstrated superiority over traditional sequence-to-sequence models. With the introduction of the spider dataset from Yale researchers, new methodologies for Text-to-SQL synthesis have emerged. The paper under discussion compares open-source and closed-source models, focusing on their ability to convert NL questions into correct SQL SELECT queries, thus pushing the boundaries of research in this area.

Open Source Models

The study initially implemented a fine-tuning paradigm on WizardLM’s WizardCoder-15B model, known for its programming task efficiency. Utilizing LoRA fine-tuning, significant execution accuracy improvements were noted at a cost-effective memory footprint. However, this model encountered limitations with few-shot in-context learning, which is critical for iterative improvement. Additionally, challenges arose through the zero-shot data fine-tuning approach, revealing a gap when compared to few-shot-capable models such as gpt-3.5-turbo and its subsequent iterations.

Closed Source Models

Transitioning to the closed-source models, the study adopted a robust approach with OpenAI's gpt-3.5-turbo-16k, later supplementing with gpt-4-turbo for error-driven correction. This phase entailed fine-tuning on the spider dataset and implementing a novel correction mechanism. Through example-driven correction, models improved their accuracy by aligning generated query outputs with ground truth result tables. Even further, error-driven correction, using gpt-4-turbo in a zero-shot environment, refined the query output, resulting in a notable peak execution accuracy of 82.1%.

Insights

Deep-diving into the nuances of generated queries, the key insights unearthed seven error categories. These range from column selection/ordering issues to incorrect aggregate function use, and from JOIN clause mishaps to incorrect query structures. Furthermore, the study uncovered flaws in the spider dataset's evaluation script, inadvertently marking correct LLM-generated queries as incorrect and sometimes asserting inaccurate ground truth queries. This calls for more sophisticated mechanisms to assess LLM-generated SQL queries' accuracy critically.

Conclusions

The comparative study of open-source versus closed-source LLMs for Text-to-SQL synthesis underlines the superiority of closed-source models equipped with few-shot in-context learning abilities. Aside from parameter count and training data differences, the research emphasizes the necessity of addressing the complexity of incorrect query structures. Conclusively, while LLMs show promise in Text-to-SQL tasks, the complexity of queries and the evaluation process pose significant challenges that warrant further exploration and refinement.

Markdown Report Issue