- The paper introduces a two-stage method that combines schema hallucination with dense retrieval to bridge natural language queries and extensive database schemas.
- The technique achieves up to a 6% recall bump on challenging datasets like SocialDB while reducing the subset size processed.
- Empirical evaluations on semi-synthetic and large-scale benchmarks demonstrate improved text-to-SQL translation accuracy and resource efficiency.
Overview of CRUSH4SQL Technique
CRUSH4SQL (Collective Retrieval Using Schema Hallucination) introduces a novel text-to-SQL generation method applicable to extensive databases with vast column counts, a scenario where existing methods struggle due to the cost or impracticality of encoding an entire DB schema. The innovativeness of CRUSH4SQL lies in its two-stage process involving a LLM that hallucinates a minimal DB schema and a retrieval-based approach to select a high-recall schema subset from the actual database schema.
LLM Schema Hallucination
Hallucination is typically viewed negatively in LLM outputs. However, CRUSH4SQL leverages hallucination as a substantive bridge between natural language queries and relevant schema elements. An LLM is prompted to generate a hallucinated schema that would plausibly support the query in question. This schema is not intended to be perfect or real but to serve as a proxy to enhance retrieval from the actual database schema.
Collectively Retrieving Schema Subsets
Once a hallucinated schema is obtained, dense retrieval techniques are employed to map these elements closely to the true DB schema, selecting a subset of schema elements for subsequent processing. A novel contribution of CRUSH4SQL is the formulation of a combinatorial optimization that maximizes the coverage of the hallucinated schema and fosters connectivity within the subset according to the schema graph. The goal is to minimize the subset size while maintaining high recall, critical for both cost-effectiveness in LLM-as-a-service scenarios and accuracy in Text-to-SQL conversions.
Empirical Validation of CRUSH4SQL
CRUSH4SQL has been comprehensively evaluated against various existing methods across several benchmarks. With a series of three newly introduced benchmarks, including two semi-synthetic datasets (combining schemas from the SPIDER and BIRD benchmarks) and an actual large-scale data warehouse called SocialDB, CRUSH4SQL consistently achieves superior recall rates of the gold schema. For instance, CRUSH4SQL demonstrates a recall bump of up to 6% compared to the state-of-the-art on modest-sized budgets on the challenging SocialDB dataset. Moreover, the reduction in retrieved schema subset size directly enhances Text-to-SQL translation accuracies in subsequent stages.
Conclusion
In summary, CRUSH4SQL represents a practical approach to text-to-SQL conversion for large databases. By utilizing schema hallucination followed by meticulous subset retrieval, it manages to offer a solution that saves on computational resources without sacrificing recall or translation accuracy. This method's promising results in both semi-synthetic and real-world benchmarks are a testament to its potential for wider adoption in Text-to-SQL applications handling extensive schema databases.