Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation (1905.08205v2)

Published 20 May 2019 in cs.CL

Abstract: We present a neural approach called IRNet for complex and cross-domain Text-to-SQL. IRNet aims to address two challenges: 1) the mismatch between intents expressed in natural language (NL) and the implementation details in SQL; 2) the challenge in predicting columns caused by the large number of out-of-domain words. Instead of end-to-end synthesizing a SQL query, IRNet decomposes the synthesis process into three phases. In the first phase, IRNet performs a schema linking over a question and a database schema. Then, IRNet adopts a grammar-based neural model to synthesize a SemQL query which is an intermediate representation that we design to bridge NL and SQL. Finally, IRNet deterministically infers a SQL query from the synthesized SemQL query with domain knowledge. On the challenging Text-to-SQL benchmark Spider, IRNet achieves 46.7% accuracy, obtaining 19.5% absolute improvement over previous state-of-the-art approaches. At the time of writing, IRNet achieves the first position on the Spider leaderboard.

Authors (7)

Jiaqi Guo (28 papers)
Zecheng Zhan (1 paper)
Yan Gao (157 papers)
Yan Xiao (32 papers)
Jian-Guang Lou (69 papers)
Ting Liu (329 papers)
Dongmei Zhang (193 papers)

Citations (349)

View on Semantic Scholar

Summary

Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation

The paper "Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation" presents a novel neural approach named IRNet, aimed at addressing the challenges inherent in synthesizing SQL queries from natural language (NL) questions, especially in complex and cross-domain settings. The focus is on overcoming two primary issues: the semantic gap between NL intents and the syntactic demands of SQL, and the difficulty in column prediction caused by cross-domain vocabulary variances.

In traditional Text-to-SQL tasks, existing neural architectures achieve impressive accuracy levels by treating the task as an end-to-end translation problem. Despite their success on datasets like ATIS and WikiSQL, these approaches struggle with the newly introduced Spider benchmark, characterized by its cross-domain scope and intricacy of SQL queries that include nested subqueries and complex clauses like GROUP BY and HAVING. Here, IRNet demonstrates tangible improvements over these models, surpassing previous state-of-the-art methods by 19.5% absolute increase in accuracy on the Spider dataset.

IRNet introduces an innovative mechanism by decomposing the SQL synthesis process into an intermediate representation, named SemQL, bridging the semantic disparity between NL input and SQL output. This approach involves three core phases:

Schema Linking: A pre-processing step that resolves schema components (tables, columns) from the NL question, enhancing the understanding of schema contextuality and addressing lexical mismatches.
Intermediate Representation (SemQL) Generation: A grammar-based neural model synthesizes SemQL, a domain-agnostic language representation that abstracts over the syntactic nuances of SQL. By circumventing SQL’s syntactical details during synthesis, and later determining clauses such as GROUP BY or HAVING through deterministic inference, the method achieves a tighter integration of domain knowledge.
Deterministic Inference: Translating SemQL to SQL leverages predefined schema-driven mappings, ensuring syntactic and semantic validity of the final SQL query while maintaining efficiency by reducing the hypothesis space during prediction.

On the benchmark Spider, IRNet achieves a notable 46.7% accuracy, climbing to the top position. Furthermore, the integration with BERT serves to enhance semantic comprehension, pushing the performance to 54.7% accuracy.

Experimentation highlights IRNet's ability to handle the linguistic diversity in column prediction by utilizing schema linking, which is beneficial across different models when dealing with out-of-domain vocabulary. The introduction of memory-augmented networks further mitigates the repeated column selection issue, refining output reliability. Notably, the SemQL representation demonstrates utility beyond IRNet, offering significant gains in accuracy for other neural models.

The implications of these findings suggest a paradigm shift in Text-to-SQL architectures, where intermediate languages serve as a robust interface layer for complex and cross-domain applications. Immediate future work could focus on refining intermediate representations, potentially incorporating sophisticated semantic parsers or more comprehensive schema knowledge integration mechanisms. Additionally, exploration into more effective linking strategies bridging NL and schema components could address residual failure cases tied to unrecognized schema elements.

Overall, this paper propounds a compelling advancement in Text-to-SQL methodologies, underscoring the necessity of leveraging intermediate representation languages for improved performance in complex query environments.

PDF Markdown

Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation (1905.08205v2)

Summary

Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation

Related Papers