Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training (2012.10309v1)
Abstract: Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural LLMs with self-supervised learning objectives, such as Masked LLM (MLM). However, based on a pilot study, we observe three issues of existing general-purpose LLMs when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.
- Peng Shi (80 papers)
- Patrick Ng (29 papers)
- Zhiguo Wang (100 papers)
- Henghui Zhu (24 papers)
- Alexander Hanbo Li (17 papers)
- Jun Wang (990 papers)
- Cicero Nogueira dos Santos (31 papers)
- Bing Xiang (74 papers)