Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 56 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 202 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Extractive Schema Linking for Text-to-SQL (2501.17174v1)

Published 23 Jan 2025 in cs.DB, cs.AI, and cs.CL

Abstract: Text-to-SQL is emerging as a practical interface for real world databases. The dominant paradigm for Text-to-SQL is cross-database or schema-independent, supporting application schemas unseen during training. The schema of a database defines the tables, columns, column types and foreign key connections between tables. Real world schemas can be large, containing hundreds of columns, but for any particular query only a small fraction will be relevant. Placing the entire schema in the prompt for an LLM can be impossible for models with smaller token windows and expensive even when the context window is large enough to allow it. Even apart from computational considerations, the accuracy of the model can be improved by focusing the SQL generation on only the relevant portion of the database. Schema linking identifies the portion of the database schema useful for the question. Previous work on schema linking has used graph neural networks, generative LLMs, and cross encoder classifiers. We introduce a new approach to adapt decoder-only LLMs to schema linking that is both computationally more efficient and more accurate than the generative approach. Additionally our extractive approach permits fine-grained control over the precision-recall trade-off for schema linking.

Summary

The paper introduces ExSL, an extractive schema linking method that leverages decoder-only LLMs to probabilistically identify relevant schema elements.
It achieves superior precision and recall on the Spider dataset by assigning fine-grained roles to columns in SQL queries.
The approach enhances execution accuracy in Text-to-SQL systems by efficiently filtering out irrelevant schema information.

Extractive Schema Linking for Text-to-SQL

Introduction

The paper "Extractive Schema Linking for Text-to-SQL" (2501.17174) presents a novel approach to schema linking in Text-to-SQL systems, utilizing the strengths of decoder-only LLMs. Text-to-SQL interfaces allow users to translate natural language queries into Structured Query Language (SQL), thereby facilitating interaction with databases without requiring SQL expertise. The research tackles the challenge posed by large database schemas, which contain many irrelevant elements for any specific query, by focusing only on relevant portions to enhance computational efficiency and query accuracy.

Figure 1: Text-to-SQL System Architecture Overview.

Traditionally, Text-to-SQL systems depended on schema-dependent models, limiting their applicability to unseen databases and schemas. Recent advancements have led to the development of schema-independent systems, enabling flexibility across different database environments. Previous methods utilized graph neural networks or generative LLMs for schema linking, but these approaches often faced efficiency and scalability issues due to large token windows and computational costs. Notably, previous systems such as RAT-SQL and RESDSQL used encoder-based methods to classify schema elements, while generative models fine-tuned for schema linking struggled with precision and recall balance.

Methodology

The proposed method introduces Extractive Schema Linking (ExSL), which leverages decoder-only LLMs to estimate relevance probabilities for schema elements without generating new tokens. This approach models schema linking as an extractive task instead of a generative one. The schema inputs, paired with the natural language queries, undergo processing in the LLM to predict subset relevance probabilistically.

Figure 2: Ground truth generation for schema linking.

Further, fine-grained control over schema linking is achieved by assigning probabilities to columns based on their roles in SQL queries: selection, joining, condition, ordering, and grouping. These fine-grained predictions enhance SQL generation by informing the LLM of specific functional interactions in constructing the SQL query.

Experiments

Empirical evaluations conducted on the Spider dataset and its variants demonstrate that ExSL surpasses previous benchmarks in both precision and recall for schema linking. The tests involved implementing schema linking using generative, encoder-based, and extractive approaches. Among these, the extractive approach markedly improved execution accuracy across diverse SQL complexities and query types.

Figure 3: Sensitivity of SQL Generation performance on Spider Dev to schema linking threshold.

Additionally, ExSL's fine-grained linking shows potential in boosting SQL generation accuracy beyond coarse linking methods, indicating its robust applicability in practical scenarios where queries involve complex conditions and multiple schema interactions.

Implications and Future Work

The introduction of ExSL marks a significant advancement in efficient and accurate schema linking, essential for scalable Text-to-SQL systems. This method aligns well with modern computational requirements by efficiently utilizing LLM architectures familiar with code syntax and structure, thereby reducing resource demands and enhancing database interaction accuracy.

Future research could explore optimizations in schema linking thresholds and assess real-world deployment in enterprise-scale databases. As the need for seamless database querying grows in various domains, the principles laid out in this paper offer pivotal enhancements for broader application, including adaptive schema linking across evolving databases.

Conclusion

Extractive Schema Linking (ExSL) addresses critical inefficiencies in Text-to-SQL systems by integrating schema linking within the framework of decoder-only LLMs. Its methodological shift from generative to extractive processes, coupled with fine-grained control, sets a new benchmark for precision, recall and execution accuracy. By optimizing schema linking, this approach facilitates more accurate SQL query generation, promising significant computational and practical benefits for real-world database interactions.