UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification (2505.18122v1)

Published 23 May 2025 in cs.CL

Abstract: Recent advances in LLMs have greatly improved Text-to-SQL performance for single-table queries. But, it remains challenging in multi-table databases due to complex schema and relational operations. Existing methods often struggle with retrieving the right tables and columns, generating accurate JOINs and UNIONs, and generalizing across diverse schemas. To address these issues, we introduce UNJOIN, a two-stage framework that decouples the retrieval of schema elements from SQL logic generation. In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name. This allows the model to focus purely on accurate retrieval without being distracted by the need to write complex SQL logic. In the second stage, the SQL query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic. Evaluations on SPIDER and BIRD datasets show that UNJOIN matches or exceeds the state-of-the-art baselines. UNJOIN uses only schema information, which does not require data access or fine-tuning, making it scalable and adaptable across databases.

Summary

The paper presents a two-stage UNJOIN framework that improves multi-table text-to-SQL generation by simplifying complex database schemas into a single-table representation.
The methodology first merges all columns into a single table representation, then translates the simplified query back to the original multi-table schema with reconstructed joins.
Evaluation shows the method achieves state-of-the-art performance on datasets like Spider and BIRD without needing data access or fine-tuning, making it scalable and adaptable.

Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification

The paper presents a novel approach to improve text-to-SQL conversion in multi-table databases by addressing the challenges posed by complex schemas and relational operations. The proposed framework seeks to enhance the retrieval accuracy of schema elements and improve SQL logic generation through a two-stage process involving schema simplification followed by query translation.

Background

Text-to-SQL systems enable natural language interfaces to query relational databases, making database access more accessible to non-experts. Previous research, leveraging advances in LLMs, has successfully improved the performance of text-to-SQL systems on single-table databases. However, multi-table databases introduce significant challenges, such as the complex identification of relevant tables and columns, generation of accurate JOINs and UNIONs, and generalization across diverse schemas.

Methodology

The paper proposes a two-stage framework:

Schema Simplification: The process begins by merging the column names of all tables within a database into a single-table representation. Each column is prefixed with its respective table name, allowing the model to focus exclusively on accurate column retrieval without being deterred by complex SQL logic requirements. This stage effectively eliminates the need for the model to understand multi-table relationships and reduces potential errors in column identification.
Query Generation and Translation: In the second stage, a SQL query is generated on the simplified schema. This intermediate query is then mapped back to the original multi-table schema by reconstructing JOINs, UNIONs, and relational logic. This two-step process disambiguates decision-making and enables more accurate, scalable, and generalizable SQL query generation.

Evaluation and Results

The evaluation on the Spider and BIRD datasets demonstrates that the proposed method either matches or exceeds state-of-the-art baselines in performance measures. This approach leverages exclusively schema information without requiring data access or fine-tuning, thereby making the solution adaptable and scalable across various database contexts.

Implications and Future Directions

The implications of this work are significant for both practical and theoretical advancements in AI-driven query systems. Practically, the proposed framework simplifies the complex task of translating natural language queries into SQL in multi-table environments, reducing the barrier of entry for users unfamiliar with database languages. Theoretically, it suggests a promising direction for modular frameworks in AI, highlighting the ability to decouple different cognitive processes (i.e., schema retrieval and logic generation) to improve overall performance.

Looking forward, this research opens up several avenues for further exploration, including refining schema simplification techniques and exploring their application to other formats of structured and semi-structured data. Future developments may involve more sophisticated data representations for handling hierarchical and deeply nested schemas typical in comprehensive real-world applications.

In conclusion, the paper's framework provides a notable step forward in improving the interaction between natural language processing and structured data, emphasizing the importance of simplified yet effective schema representations for advancing LLM capabilities in complex SQL generation tasks.