CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL (2311.01173v1)

Published 2 Nov 2023 in cs.CL

Abstract: Existing Text-to-SQL generators require the entire schema to be encoded with the user text. This is expensive or impractical for large databases with tens of thousands of columns. Standard dense retrieval techniques are inadequate for schema subsetting of a large structured database, where the correct semantics of retrieval demands that we rank sets of schema elements rather than individual elements. In response, we propose a two-stage process for effective coverage during retrieval. First, we instruct an LLM to hallucinate a minimal DB schema deemed adequate to answer the query. We use the hallucinated schema to retrieve a subset of the actual schema, by composing the results from multiple dense retrievals. Remarkably, hallucination $\unicode{x2013}$ generally considered a nuisance $\unicode{x2013}$ turns out to be actually useful as a bridging mechanism. Since no existing benchmarks exist for schema subsetting on large databases, we introduce three benchmarks. Two semi-synthetic datasets are derived from the union of schemas in two well-known datasets, SPIDER and BIRD, resulting in 4502 and 798 schema elements respectively. A real-life benchmark called SocialDB is sourced from an actual large data warehouse comprising 17844 schema elements. We show that our method1 leads to significantly higher recall than SOTA retrieval-based augmentation methods.

Citations (11)

View on Semantic Scholar

Summary

The paper introduces a two-stage method that combines schema hallucination with dense retrieval to bridge natural language queries and extensive database schemas.
The technique achieves up to a 6% recall bump on challenging datasets like SocialDB while reducing the subset size processed.
Empirical evaluations on semi-synthetic and large-scale benchmarks demonstrate improved text-to-SQL translation accuracy and resource efficiency.

Overview of CRUSH4SQL Technique

CRUSH4SQL (Collective Retrieval Using Schema Hallucination) introduces a novel text-to-SQL generation method applicable to extensive databases with vast column counts, a scenario where existing methods struggle due to the cost or impracticality of encoding an entire DB schema. The innovativeness of CRUSH4SQL lies in its two-stage process involving a LLM that hallucinates a minimal DB schema and a retrieval-based approach to select a high-recall schema subset from the actual database schema.

LLM Schema Hallucination

Hallucination is typically viewed negatively in LLM outputs. However, CRUSH4SQL leverages hallucination as a substantive bridge between natural language queries and relevant schema elements. An LLM is prompted to generate a hallucinated schema that would plausibly support the query in question. This schema is not intended to be perfect or real but to serve as a proxy to enhance retrieval from the actual database schema.

Collectively Retrieving Schema Subsets

Once a hallucinated schema is obtained, dense retrieval techniques are employed to map these elements closely to the true DB schema, selecting a subset of schema elements for subsequent processing. A novel contribution of CRUSH4SQL is the formulation of a combinatorial optimization that maximizes the coverage of the hallucinated schema and fosters connectivity within the subset according to the schema graph. The goal is to minimize the subset size while maintaining high recall, critical for both cost-effectiveness in LLM-as-a-service scenarios and accuracy in Text-to-SQL conversions.

Empirical Validation of CRUSH4SQL

CRUSH4SQL has been comprehensively evaluated against various existing methods across several benchmarks. With a series of three newly introduced benchmarks, including two semi-synthetic datasets (combining schemas from the SPIDER and BIRD benchmarks) and an actual large-scale data warehouse called SocialDB, CRUSH4SQL consistently achieves superior recall rates of the gold schema. For instance, CRUSH4SQL demonstrates a recall bump of up to 6% compared to the state-of-the-art on modest-sized budgets on the challenging SocialDB dataset. Moreover, the reduction in retrieved schema subset size directly enhances Text-to-SQL translation accuracies in subsequent stages.

Conclusion

In summary, CRUSH4SQL represents a practical approach to text-to-SQL conversion for large databases. By utilizing schema hallucination followed by meticulous subset retrieval, it manages to offer a solution that saves on computational resources without sacrificing recall or translation accuracy. This method's promising results in both semi-synthetic and real-world benchmarks are a testament to its potential for wider adoption in Text-to-SQL applications handling extensive schema databases.

PDF Markdown

Related Papers

GitHub

GitHub - iMayK/CRUSH4SQL (10 stars)