LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL

Published 24 Mar 2025 in cs.CL | (2503.18596v4)

Abstract: Schema linking is a critical bottleneck in applying existing Text-to-SQL models to real-world, large-scale, multi-database environments. Through error analysis, we identify two major challenges in schema linking: (1) Database Retrieval: accurately selecting the target database from a large schema pool, while effectively filtering out irrelevant ones; and (2) Schema Item Grounding: precisely identifying the relevant tables and columns within complex and often redundant schemas for SQL generation. Based on these, we introduce LinkAlign, a novel framework tailored for large-scale databases with thousands of fields. LinkAlign comprises three key steps: multi-round semantic enhanced retrieval and irrelevant information isolation for Challenge 1, and schema extraction enhancement for Challenge 2. Each stage supports both Agent and Pipeline execution modes, enabling balancing efficiency and performance via modular design. To enable more realistic evaluation, we construct AmbiDB, a synthetic dataset designed to reflect the ambiguity of real-world schema linking. Experiments on widely-used Text-to-SQL benchmarks demonstrate that LinkAlign consistently outperforms existing baselines on all schema linking metrics. Notably, it improves the overall Text-to-SQL pipeline and achieves a new state-of-the-art score of 33.09% on the Spider 2.0-Lite benchmark using only open-source LLMs, ranking first on the leaderboard at the time of submission. The codes are available at https://github.com/Satissss/LinkAlign

Abstract PDF Upgrade to Chat

Summary

The paper introduces LinkAlign, a framework that resolves schema linking challenges in large-scale, multi-database Text-to-SQL tasks.
It employs multi-round semantic retrieval and irrelevant information isolation to accurately select databases and extract precise schema components.
Experimental results on datasets like Spider and AmbiDB demonstrate LinkAlign’s state-of-the-art performance using open-source LLMs.

Introduction

"LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL" (2503.18596) addresses the challenge of schema linking in large-scale, multi-database environments for Text-to-SQL tasks. The paper identifies Database Retrieval and Schema Item Grounding as critical bottlenecks and proposes a framework named LinkAlign to resolve these issues. It offers a novel approach involving multi-round semantic retrieval, isolation of irrelevant schemas, and enhanced schema extraction. The framework aims to streamline the schema linking process, ultimately improving the accuracy of SQL generation in complex database settings.

Schema Linking Challenges

The difficulty in schema linking primarily stems from two challenges. Database Retrieval involves selecting the correct database from a large set, often containing redundant schemas. Effective retrieval must isolate relevant databases while filtering out noise. Schema Item Grounding focuses on accurately identifying the necessary tables and columns within complex schemas. Both tasks are essential for generating accurate SQL queries from natural language inputs.

Figure 1: Overview of the LinkAlign framework including three core components.

LinkAlign Framework

LinkAlign tackles these challenges through a modular approach divided into three key steps. Semantic Enhanced Retrieval focuses on Database Retrieval, employing query rewriting to infer missing schemas and enhance semantic alignment. This step dynamically adjusts retrieval strategies based on feedback, ensuring efficient recall of relevant databases. Irrelevant Information Isolation aims to eliminate schema noise, refining the target database localization process. Schema Extraction Enhancement scales schema linking by precisely identifying tables and columns necessary for SQL generation using advanced reasoning techniques.

Figure 2: The impact on Error Rates.

Each component offers two execution paradigms—Pipeline and Agent. The Pipeline mode emphasizes efficiency, making it suitable for real-time applications, while the Agent mode prioritizes accuracy through collaborative multi-turn reasoning.

Experimental Evaluation

The framework's effectiveness is validated through comprehensive error analysis and evaluations on datasets like Spider, Bird, and AmbiDB. The introduction of the synthetic AmbiDB dataset allows for realistic testing, simulating large-scale, multi-database scenarios. Experiments show LinkAlign consistently surpasses baseline models, achieving state-of-the-art performance on schema linking metrics and improving the Text-to-SQL pipeline's scores on challenging Spider 2.0 benchmarks. Notably, LinkAlign reaches a top leaderboard score using only open-source LLMs, underscoring its practical viability.

Figure 3: Error Distribution in Failed Cases.

Implications and Future Directions

LinkAlign sets a new precedence in schema linking for complex Text-to-SQL applications, demonstrating significant improvements in handling ambiguous and large-scale database schemas. This framework can facilitate better integration of LLMs in enterprise settings, potentially automating the translation of natural language to SQL with greater precision. Future work may explore enhancing retrieval strategies further or integrating more sophisticated reasoning capabilities to handle increasingly complex queries and schemas.

Conclusion

The LinkAlign framework presents a significant advancement in the field of Text-to-SQL, addressing critical challenges of schema linking in large-scale, multi-database environments. Its modular design and capability to dynamically adjust for efficiency and accuracy make it a robust solution for practical applications. As AI continues to evolve, frameworks like LinkAlign are poised to enable more intuitive and precise data retrieval solutions across diverse database systems.