Synthesizing Text-to-SQL Data from Weak and Strong LLMs (2408.03256v1)

Published 6 Aug 2024 in cs.CL

Abstract: The capability gap between open-source and closed-source LLMs remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by smaller, not well-aligned models (weak models). The method not only enhances the domain generalization of text-to-SQL models but also explores the potential of error data supervision through preference learning. Furthermore, we employ the synthetic data approach for instruction tuning on open-source LLMs, resulting SENSE, a specialized text-to-SQL model. The effectiveness of SENSE is demonstrated through state-of-the-art results on the SPIDER and BIRD benchmarks, bridging the performance gap between open-source models and methods prompted by closed-source models.

PDF HTML Abstract

Synthesizing Text-to-SQL Data from Weak and Strong LLMs

The paper "Synthesizing Text-to-SQL Data from Weak and Strong LLMs" by Jiaxi Yang et al. addresses the persistent performance gap between open-source and closed-source LLMs in text-to-SQL tasks by leveraging an innovative approach combining synthetic data generated from both strong and weak models. The core contribution is the development of the text-to-SQL model, \sense, which showcases substantial improvements in domain generalization and model robustness.

Summary of Approach and Methods

Background and Motivation

The ability to transform natural language queries into SQL statements facilitates non-expert interaction with databases. Although closed-source LLMs like GPT-4 have shown impressive results in text-to-SQL tasks, privacy concerns, cost barriers, and lack of openness limit their applicability. Conversely, the performance of open-source LLMs lags significantly behind their proprietary counterparts. To bridge this gap, the authors propose a synthetic data strategy combining data from both strong and weak models.

Synthetic Data Generation

The authors distinguish between strong data synthesized by potent models like GPT-4 and weak data generated by smaller, less aligned models. Strong data is utilized to enhance cross-domain generalization, ensuring the model can handle diverse queries and schemas. This data generation is fine-tuned to avoid overrepresented domains, incorporating varied difficulty levels.

In contrast, weak data capitalizes on the erroneous outputs from weaker models. These weaker models naturally produce valuable incorrect SQL samples, which, when validated and error-induced through feedback from SQL executors, serve as instructional material for the model to learn from mistakes. Preference learning is employed here to distinguish correct from incorrect SQL queries, reinforcing model learning.

Model Training and Evaluation

The authors fine-tune the open-source base model, CodeLLaMA, in two stages. Initially, supervised fine-tuning (SFT) is applied using the strong data, focusing on enhancing text-to-SQL capabilities by showcasing diverse and robust SQL examples. Subsequently, preference learning leverages the weak data to correct and refine model predictions based on execution feedback, akin to learning from human error correction.

The resulting specialized models, \sense{-7B} and \sense{-13B}, are evaluated against prominent text-to-SQL benchmarks such as SPIDER and BIRD, revealing marked improvements. \sense achieves state-of-the-art (SOTA) results, even competing robustly with methods based on closed-source LLMs. Additionally, the models demonstrate enhanced performance on robustness datasets like SYN, REALISTIC, and DK, highlighting the efficacy of synthetic data in catering to real-world complexities.

Experimental Results

Several key points illustrate the successful outcome of the proposed methodology:

Performance Metrics:
- On the SPIDER benchmark, \sense{-13B} achieved an execution (EX) accuracy of 84.1% on the development set, closely rivaling closed-source models.
- On the BIRD benchmark, \sense{-13B} outperformed existing models with a test EX accuracy of 63.4%.
Robustness:
- For SYN, REALISTIC, and DK robustness benchmarks, \sense{-13B} achieved an average score of 77.7%, demonstrating superior handling of diverse and challenging SQL generation scenarios.
Cross-Domain Generalization:
- The synthetic data's diversity, highlighted by a broader domain distribution (as shown in long-tail distributions), significantly enhances the model's cross-domain generalization capabilities.

Implications and Future Research

The implications of this research are substantial, both practically and theoretically:

Practical Implications

Democratization of Data Access: By narrowing the performance gap between open and closed-source models, \sense facilitates broader access and applicability of advanced text-to-SQL capabilities without the constraints imposed by proprietary systems.
Cost and Privacy Considerations: Leveraging open-source models mitigates concerns related to substantial costs and privacy issues associated with closed-source LLMs, promoting wider adoption, especially in sensitive or resource-limited environments.

Theoretical Implications

Model Training Paradigms: The combination of supervised fine-tuning with preference learning from synthetic data offers a novel paradigm for training models, potentially applicable beyond text-to-SQL tasks.
Domain Adaptation and Generalization: The paper underlines the efficacy of synthetic, diverse data in enhancing domain adaptation, pointing to potential research areas in improving model performance through synthesized, varied data sources.

Conclusion and Speculations on Future Developments

The research by Jiaxi Yang et al. presents a compelling approach to bridging gaps in text-to-SQL model performance using synthetic data from both strong and weak LLMs. While the models achieve SOTA results in the text-to-SQL domain, future research could explore extending this methodology to other areas of AI, such as code generation, question answering, or even more complex NLP tasks. Given the demonstrated benefits of combining strong and erroneous synthetic data, this dual approach might offer wider potential for refining and advancing LLMs in general.