Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DB-GPT-Hub: Towards Open Benchmarking Text-to-SQL Empowered by Large Language Models (2406.11434v1)

Published 17 Jun 2024 in cs.DB
DB-GPT-Hub: Towards Open Benchmarking Text-to-SQL Empowered by Large Language Models

Abstract: LLMs becomes the dominant paradigm for the challenging task of text-to-SQL. LLM-empowered text-to-SQL methods are typically categorized into prompting-based and tuning approaches. Compared to prompting-based methods, benchmarking fine-tuned LLMs for text-to-SQL is important yet under-explored, partially attributed to the prohibitively high computational cost. In this paper, we present DB-GPT-Hub, an open benchmark suite for LLM-empowered text-to-SQL, which primarily focuses on tuning LLMs at large scales. The proposed benchmark consists of: 1. a standardized and comprehensive evaluation of text-to-SQL tasks by fine-tuning medium to large-sized open LLMs; 2. a modularized and easy-to-extend codebase with mainstream LLMs and experimental scenarios supported, which prioritizes fine-tuning methods but can be easily extended to prompt-based setting. Our work investigates the potential gains and the performance boundaries of tuning approaches, compared to prompting approaches and explores optimal solutions tailored to specific scenarios. We hope DB-GPT-Hub, along with these findings, enables further research and broad applications that would otherwise be difficult owing to the absence of a dedicated open benchmark. The project code has been released at https://github.com/eosphoros-ai/DB-GPT-Hub.

Overview of DB-GPT-Hub: A Benchmark Suite for Text-to-SQL

The paper "DB-GPT-Hub: Towards Open Benchmarking Text-to-SQL Empowered by LLMs" introduces an open benchmarking suite designed specifically for fine-tuning LLMs in the challenging domain of text-to-SQL tasks. This suite encompasses various LLMs, datasets, and experimental settings, providing a standardized and comprehensive evaluation landscape for academic and industrial researchers.

Key Contributions

The primary contributions of this paper are as follows:

  1. Standardization: The DB-GPT-Hub establishes a standardized pipeline within an open-source codebase. This pipeline ensures transparent and consistent comparisons across various LLM models post text-to-SQL task tuning.
  2. Comprehensiveness: It provides extensive benchmarking, covering medium to large-sized fine-tuned LLMs and comparative performance analysis against prompting methods. This broad application spectrum makes DB-GPT-Hub one of the most pragmatic and expansive benchmark suites available.
  3. Extensibility: Importantly, DB-GPT-Hub facilitates easy integration of novel methods, datasets, modules, models, and evaluation programs. This framework is designed to be compatible with various prompting techniques, which significantly benefits the evolving research area of text-to-SQL.

Background and Methodology

The task of text-to-SQL involves transforming natural language questions into SQL queries, which requires intricate handling of database schemas and natural language semantics. The paper categorizes LLM-empowered text-to-SQL methods into two primary approaches: zero-shot/few-shot prompting and fine-tuning. While prompting-based methods provide initial insights, fine-tuning has shown promising potential in enhancing model performance, albeit with higher computational demands.

To address the gap brought by the absence of an open benchmark for tuning approaches, DB-GPT-Hub leverages Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA and QLoRA to adapt medium- and large-sized models efficiently. The suite includes a modularized codebase supporting both fine-tuning and prompting settings, thus facilitating diverse experimental scenarios.

Benchmark Design and Resources

Datasets: Experiments are primarily conducted on two well-known public datasets:

  • Spider: A cross-domain dataset comprising 10,181 natural language queries across 200 databases.
  • BIRD: An extensive dataset with 12,751 unique question-SQL pairs, spanning 95 large databases.

Additionally, the codebase supports various popular datasets like WikiSQL, CoSQL, and Chase, promoting comprehensive and up-to-date evaluations.

Metrics: The suite uses Exact-set match accuracy (EM) and Execution accuracy (EX) to assess model performance, with EX providing a more accurate measure by comparing execution outputs rather than exact token matches.

Base LLMs: DB-GPT-Hub benchmarks several prominent LLM families (GLM, Qwen, Baichuan, Llama) of varying sizes (7B to 70B parameters).

Benchmark Pipeline: A meticulously designed pipeline for dataset construction, training, prediction, and evaluation underpins the standardization and extensibility promised by DB-GPT-Hub.

Experimental Results

The experimental analysis covers various aspects:

  • Model Performance: CodeLlama models, particularly CodeLlama-70B, consistently outperform others, showcasing the impact of training data and model architecture.
  • Tuning Improvements: Significant improvements from PEFT strategies highlight the adaptability of LLMs to high-quality text-to-SQL datasets.
  • Comparative Analysis: Tuning shows notable benefits over prompting methods, especially in simpler tasks and when datasets are less complex, confirming the value of dedicated training for such tasks.

Implications and Future Directions

The open benchmarking suite provided by DB-GPT-Hub addresses a critical need in the text-to-SQL research community by enabling systematic and reproducible evaluations of large-scale models. The modularity and extensibility of the codebase encourage continuous integration of emerging techniques and datasets, fostering a dynamic and robust testing environment.

Theoretical Implications: The extensive benchmarking and fine-tuning insights contribute to a deeper understanding of LLMs' capabilities in handling complex structured data tasks. These findings can inform the development of more sophisticated models and training paradigms.

Practical Implications: For practitioners, DB-GPT-Hub offers a practical toolkit to benchmark and fine-tune models efficiently, thereby propelling advancements in applications where converting natural language to structured queries is pivotal (e.g., data analytics, conversational agents).

Future Developments: The research outlines potential future work, including enhancing model capabilities for more complex dialogue and analytics scenarios and integrating additional training techniques such as continual learning and advanced prompt engineering.

In conclusion, DB-GPT-Hub represents a comprehensive and valuable resource that bridges existing gaps in LLM evaluation for text-to-SQL tasks. Its emphasis on robust benchmarking, combined with an open and flexible setup, sets a solid foundation for future research and practical advancements in this dynamic field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Fan Zhou (110 papers)
  2. Siqiao Xue (29 papers)
  3. Danrui Qi (6 papers)
  4. Wenhui Shi (24 papers)
  5. Wang Zhao (20 papers)
  6. Ganglin Wei (3 papers)
  7. Hongyang Zhang (71 papers)
  8. Caigai Jiang (1 paper)
  9. Gangwei Jiang (17 papers)
  10. Zhixuan Chu (43 papers)
  11. Faqiang Chen (4 papers)
Citations (3)