DB-GPT-Hub: Modular LLM Text-to-SQL Benchmark

Updated 26 February 2026

DB-GPT-Hub is an open, modular benchmarking platform for LLM-powered text-to-SQL systems that supports reproducible evaluation and scalable fine-tuning.
It integrates techniques like PEFT (LoRA, QLoRA), retrieval-augmented generation, and multi-agent orchestration to optimize performance and ensure privacy-preserving deployments.
Designed for rigorous testing on benchmarks such as Spider and BIRD, the platform enables fair comparisons across models ranging from 7B to 70B parameters.

DB-GPT-Hub is an open, modular benchmarking and deployment platform targeted at LLM-empowered text-to-SQL systems. It is designed to rigorously evaluate, fine-tune, and operationalize LLMs for natural-language-to-database and broader data interaction workflows. The system prioritizes scalability across model sizes (7B–70B), task-specific controllability, extensibility to new data/models, and reproducible evaluation on established and emergent benchmarks. It integrates solutions for privacy-preserving local deployment, retrieval-augmented generation (RAG), multi-agent orchestration, and comprehensive cost–performance analysis, consolidating prior advances in LLM data systems under a transparent, extensible hub architecture (Zhou et al., 2024, Xue et al., 2023, Xue et al., 2024).

1. Motivation and Benchmarking Gaps in Text-to-SQL

Recent advances in text-to-SQL have been dominated by LLMs employing zero- or few-shot prompting strategies, particularly on benchmarks like Spider. However, empirical results reveal that fine-tuning such models on curated text-to-SQL data yields consistent, significant gains in execution accuracy, particularly for lower-complexity queries and scenarios where cost and latency constraints preclude commercial API-based inference (Zhou et al., 2024). Despite the importance of tuning-based approaches, the community has lacked an open, standardized platform for large-scale, reproducible benchmarking and extensibility across model and dataset regimes. DB-GPT-Hub directly addresses this gap, enabling systematic, cost-aware evaluation, enhancing reproducibility, and providing rigorous comparisons between prompting and tuning regimes under controlled, containerized settings (Zhou et al., 2024).

2. System Architecture and Modularity

Pipeline Overview

DB-GPT-Hub comprises a multi-stage, modular pipeline:

Data Ingestion and Prompt Construction: Each input triple of (natural language question $q$ , SQL $s$ , schema $D$ ) is transformed into a Text Representation Prompt (TRP) encoding instructions, schema, question, and a response separator (Zhou et al., 2024).
Model Fine-Tuning: The system supports parameter-efficient fine-tuning (PEFT), notably LoRA and QLoRA, across HuggingFace-compatible open LLMs (e.g., ChatGLM3, Qwen, Baichuan2, LLaMA2, CodeLLaMA). All hyperparameters (learning rate, rank, epochs, context and response lengths) are standardized to ensure fair comparison (Zhou et al., 2024).
Inference and Evaluation: Provision is made for zero-, few-shot (1/3/5), and fully fine-tuned modes. Post-processing transforms model outputs to SQL and computes benchmark metrics including Exact Set Match (EM) and Execution Accuracy (EX).

Extensibility and Integration

The codebase is written in PyTorch and openly published. Key modules cover datasets, model wrappers and PEFT injection, distributed/multi-GPU training, prompt templates, and evaluation harnesses. New datasets can be incorporated via dataset-specific parsers and corresponding TRP templates; LLMs are registered with their tokenizers and configs for seamless invocation in training/evaluation commands. Feature enhancements such as alternative prompting mechanisms (e.g., chain-of-thought) or expanded evaluation metrics can be readily inserted (Zhou et al., 2024).

Core Architectural Stack

Subsystem	Description	Supported Technologies
Protocol Layer	Multi-agent DSL for workflows	AWEL (DAG-based)
Module Layer	Model mgmt, RAG, orchestration	SMMF, vector+inverted KB
Server Layer	API/gRPC frontend	HTTP/gRPC, metadata hints
Application Layer	Prebuilt analytic/data tasks	Text-to-SQL, chat2DB/Excel
Visualization	Output rendering/UI	React, interactive widgets

(Xue et al., 2024, Xue et al., 2023)

3. Benchmark Suite, Evaluation Protocols, and Supported Models

Datasets and Complexity Regimes

Spider: 8.6k train examples, multitable, cross-domain with stratified SQL difficulty (Easy/Medium/Hard/Extra).
BIRD: 12.7k query-SQL pairs over 95 large, real-world RDBs, range of complexity.
Additional support: WikiSQL (single-table), CoSQL (conversational), Chase (Chinese), extensible to new corpora (Zhou et al., 2024).

Evaluation Metrics

For $N$ examples, predicted SQL $\hat{y}_i$ , gold SQL $y_i$ , and $exec(\cdot)$ as result set runner:

$\mathrm{ExactMatch} = \frac{1}{N}\sum_{i=1}^N \mathbb{1}\{\hat y_i = y_i\}, \quad \mathrm{ExecAcc} = \frac{1}{N}\sum_{i=1}^N \mathbb{1}\{exec(\hat y_i)=exec(y_i)\}.$

Some datasets also compute Valid Efficiency Score (VES), scaling correct SQLs by execution-time ratios (Ma et al., 6 Mar 2025).

LLM Backbone Coverage

ChatGLM3-6B
Qwen (7B/14B/72B-Chat)
Baichuan2 (7B/13B-Chat)
LLaMA2 (7B/13B/70B/CodeLLaMA)
HuggingFace-style adapters enable transparent backbone switching (Zhou et al., 2024).

4. Parameter-Efficient Tuning and Comparative Performance

Tuning Methodology

PEFT: LoRA injects low-rank adaptation matrices into transformer blocks, enabling rapid fine-tuning with minimal full-model weights. QLoRA quantizes base weights to 4-bit, reducing VRAM at the cost of moderate training time increase (Zhou et al., 2024).
Objective: Cross-entropy on tokenized SQL, subject to hyperparameters unified across runs.

Empirical Findings

Model & Scale	Spider EX (Base➝LoRA)	BIRD EX (Base➝LoRA)
Llama2-7B	0.000 ➝ 0.626	0.000 ➝ 0.169
CodeLLaMA-7B	0.149 ➝ 0.702	0.085 ➝ 0.237
CodeLLaMA-70B	0.567 ➝ 0.771	–
Qwen-14B	–	–

Few-shot prompting with Qwen-7B yields 0-shot EX=22.9%, 5-shot EX=33.8%, whereas LoRA-tuned models attain 0-shot EX=65.3%, 5-shot EX=63.5% (Zhou et al., 2024).

Resource and Scalability Analysis

QLoRA typically halves memory usage (e.g., Llama2-7B: 23.5 ➝ 16.9 GB), with a ~1.5x increase in training time and negligible EX drop.
Large models (70B) achieve highest EX (~0.77), but appropriately tuned 13B models with PEFT often outperform much larger, prompt-only variants (Zhou et al., 2024).
For practitioners constrained by hardware, 7B/13B QLoRA-tuned models offer strong accuracy with modest requirements.

5. Integration with Automated Instruction Synthesis and Schema Alignment

DB-GPT-Hub systems can leverage instruction mining frameworks such as DB-Explore, which construct relational database graphs $G = (V, E, w)$ (where edge weights $w(e)$ are proportional to edge co-occurrence frequencies), and sample subgraphs for systematic instruction synthesis (Ma et al., 6 Mar 2025). Three axes of instruction generation are supported:

Semantic Knowledge Extraction: Prompting LLMs with schema subgraphs and curated NL-SQL pairs to generate diverse QA triplets.
Structural Pattern Mining: Targeted sampling of subgraphs containing foreign keys and complex joins to expose the LLM to realistic query structures.
Progressive Instruction Synthesis: Building SQL ASTs by iteratively grafting new conditions/joins, then generating matching NL annotations.

After validation/execution screening, the resulting corpus supports cascaded fine-tuning: first for schema linking (predicting pertinent schema from NL query and full database), then SQL generation (NL and schema to SQL). Such schemas and instruction processes are directly compatible with DB-GPT-Hub’s fine-tuning modules and enhance schema comprehension at all model sizes (Ma et al., 6 Mar 2025).

6. Multi-Agent, Privacy-Preserving, and Product-Ready Deployment

DB-GPT-Hub draws on architectural features matured in DB-GPT and associated systems:

Privacy and Controllability: All inference, fine-tuning, and RAG operations can execute in local or private cloud environments. The Service-Oriented Multi-Model Framework (SMMF) handles model deployment, access control, and trusted execution (e.g., via vLLM, TensorRT, HuggingFace TGI) (Xue et al., 2023, Xue et al., 2024).
Multi-Agent Orchestration: The Agentic Workflow Expression Language (AWEL) provides a DAG-based DSL for specifying agentic analytic workflows; specialized agents (planner, chart generator, aggregator) communicate via a persisted event bus and can be orchestrated with fine control (Xue et al., 2024).
Retrieval-Augmented Generation (RAG): Vector, inverted, and graph indexes are constructed from private documents; in-context learning leverages top-ranked retrieved passages to improve answer recall and SQL grounding (Xue et al., 2023). All knowledge retrieval and augmentation operations can be locally executed and configured for hybrid retrieval strategies.

7. Extensibility, Best Practices, and Future Directions

DB-GPT-Hub emphasizes reproducibility and extensibility:

Codebase Practices: Fully containerized pipelines with version-controlled hyperparameters, seed logging, and transparent evaluation scripts (Zhou et al., 2024).
Addition of New Resources: Plug-and-play integration of new LLMs, datasets, and prompt strategies through configuration and registry modules.
Extensible Infrastructure: Support for distributed and cloud-based execution (Ray, Docker/Kubernetes), as well as REST/React APIs for integration into product environments.

Prospective Advances

Planned directions include benchmarking more advanced SQL (e.g., window functions, recursive CTEs), integrating sequential or online multi-agent decision-making, and exploring continual learning or hypernetwork augmentation to minimize full retraining costs. A plausible implication is that standardized, reproducible analysis of diverse model/benchmark/task settings will enable robust advances in text-to-SQL and data-centric LLM system deployments (Zhou et al., 2024).

References:

(Zhou et al., 2024) DB-GPT-Hub: Towards Open Benchmarking Text-to-SQL Empowered by LLMs
(Ma et al., 6 Mar 2025) DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL
(Xue et al., 2023) DB-GPT: Empowering Database Interactions with Private LLMs
(Xue et al., 2024) Demonstration of DB-GPT: Next Generation Data Interaction System Empowered by LLMs

Markdown Report Issue Upgrade to Chat

References (4)

DB-GPT-Hub: Towards Open Benchmarking Text-to-SQL Empowered by Large Language Models (2024)

DB-GPT: Empowering Database Interactions with Private Large Language Models (2023)

Demonstration of DB-GPT: Next Generation Data Interaction System Empowered by Large Language Models (2024)

DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DB-GPT-Hub.

DB-GPT-Hub: Modular LLM Text-to-SQL Benchmark

1. Motivation and Benchmarking Gaps in Text-to-SQL

2. System Architecture and Modularity

Pipeline Overview

Extensibility and Integration

Core Architectural Stack

3. Benchmark Suite, Evaluation Protocols, and Supported Models

Datasets and Complexity Regimes

Evaluation Metrics

LLM Backbone Coverage

4. Parameter-Efficient Tuning and Comparative Performance

Tuning Methodology

Empirical Findings

Resource and Scalability Analysis

5. Integration with Automated Instruction Synthesis and Schema Alignment

6. Multi-Agent, Privacy-Preserving, and Product-Ready Deployment

7. Extensibility, Best Practices, and Future Directions

Prospective Advances

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DB-GPT-Hub: Modular LLM Text-to-SQL Benchmark

1. Motivation and Benchmarking Gaps in Text-to-SQL

2. System Architecture and Modularity

Pipeline Overview

Extensibility and Integration

Core Architectural Stack

3. Benchmark Suite, Evaluation Protocols, and Supported Models

Datasets and Complexity Regimes

Evaluation Metrics

LLM Backbone Coverage

4. Parameter-Efficient Tuning and Comparative Performance

Tuning Methodology

Empirical Findings

Resource and Scalability Analysis

5. Integration with Automated Instruction Synthesis and Schema Alignment

6. Multi-Agent, Privacy-Preserving, and Product-Ready Deployment

7. Extensibility, Best Practices, and Future Directions

Prospective Advances

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research