SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models (2311.09818v2)
Abstract: While most conversational agents are grounded on either free-text or structured knowledge, many knowledge corpora consist of hybrid sources. This paper presents the first conversational agent that supports the full generality of hybrid data access for large knowledge corpora, through a language we developed called SUQL (Structured and Unstructured Query Language). Specifically, SUQL extends SQL with free-text primitives (summary and answer), so information retrieval can be composed with structured data accesses arbitrarily in a formal, succinct, precise, and interpretable notation. With SUQL, we propose the first semantic parser, an LLM with in-context learning, that can handle hybrid data sources. Our in-context learning-based approach, when applied to the HybridQA dataset, comes within 8.9% exact match and 7.1% F1 of the SOTA, which was trained on 62K data samples. More significantly, unlike previous approaches, our technique is applicable to large databases and free-text corpora. We introduce a dataset consisting of crowdsourced questions and conversations on Yelp, a large, real restaurant knowledge base with structured and unstructured data. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 90.3% of the time, compared to 63.4% for a baseline based on linearization.
- Skill-based few-shot selection for in-context learning.
- Task-oriented dialogue as dataflow synthesis. Transactions of the Association for Computational Linguistics, 8:556–571.
- Adapt and decompose: Efficient generalization of text-to-sql via domain adapted least-to-most prompting.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- A few-shot semantic parser for Wizard-of-Oz dialogues with the precise ThingTalk representation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4021–4034, Dublin, Ireland. Association for Computational Linguistics.
- Program transfer for answering complex questions over knowledge bases. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8128–8140, Dublin, Ireland. Association for Computational Linguistics.
- HybridQA: A dataset of multi-hop question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online. Association for Computational Linguistics.
- Conversational semantic parsing for dialog state tracking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8107–8117, Online. Association for Computational Linguistics.
- Neural generation meets real people: Building a social, informative open-domain dialogue agent. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 376–395, Edinburgh, UK. Association for Computational Linguistics.
- E. F. Codd. 1972. Relational completeness of data base sublanguages. Research Report / RJ / IBM / San Jose, California, RJ987.
- Turl: Table understanding through representation learning. ACM SIGMOD Record, 51(1):33–40.
- MATE: Multi-view attention for table transformer efficiency. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7606–7619, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Multi-hop open-domain question answering over structured and unstructured knowledge. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 151–156, Seattle, United States. Association for Computational Linguistics.
- Enabling large language models to generate text with citations.
- Yu Gu and Yu Su. 2022. ArcaneQA: Dynamic program induction and contextualized encoding for knowledge base question answering. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1718–1731, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Prompting gpt-3.5 for text-to-sql with de-semanticization and skeleton retrieval. In Pacific Rim International Conference on Artificial Intelligence, pages 262–274. Springer.
- Tong Guo and Huilin Gao. 2020. Content enhanced bert-based text-to-sql generation.
- TaPas: Weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4320–4333, Online. Association for Computational Linguistics.
- In-context learning for few-shot dialogue state tracking.
- Paul Hudak. 1989. Conception, evolution, and application of functional programming languages. ACM Comput. Surv., 21(3):359–411.
- TABBIE: Pretrained representations of tabular data. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3446–3456, Online. Association for Computational Linguistics.
- Active retrieval augmented generation.
- Can I be of further assistance? using unstructured knowledge access to improve task-oriented conversational modeling. In Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021), pages 119–127, Online. Association for Computational Linguistics.
- Evaluating open-domain question answering in the era of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5591–5606, Toronto, Canada. Association for Computational Linguistics.
- Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp.
- Multi-row, multi-span distant supervision for Table+Text question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8080–8094, Toronto, Canada. Association for Computational Linguistics.
- MAFiD: Moving average equipped fusion-in-decoder for question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2337–2344, Dubrovnik, Croatia. Association for Computational Linguistics.
- s3superscript𝑠3s^{3}italic_s start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT hqa: A three-stage approach for multi-hop text-table hybrid question answering. arXiv preprint arXiv:2305.11725.
- S3HQA: A three-stage approach for multi-hop text-table hybrid question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1731–1740, Toronto, Canada. Association for Computational Linguistics.
- Augmenting multi-turn text-to-SQL datasets with self-play. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5608–5620, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Enhancing few-shot text-to-sql capabilities of large language models: A study on prompt design strategies.
- UniK-QA: Unified representations of structured and unstructured knowledge for open-domain question answering. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1535–1546, Seattle, United States. Association for Computational Linguistics.
- Synchromesh: Reliable code generation from pre-trained language models.
- Prolific. 2023. https://www.prolific.com. Acessed: June 2023.
- DuoRAT: Towards simpler text-to-SQL models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1313–1321, Online. Association for Computational Linguistics.
- WikiChat: Stopping the hallucination of large language model chatbots by few-shot grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2387–2413, Singapore. Association for Computational Linguistics.
- Exploring hybrid question answering via program-based prompting.
- Gpt4table: Can large language models understand structured table data? a benchmark and empirical study. Proceedings of WSDM 2024.
- Iterative hierarchical attention for answering complex questions over long documents.
- Sqlprompt: In-context text-to-sql with minimal labeled data.
- RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7567–7578, Online. Association for Computational Linguistics.
- Tracking interaction states for multi-turn text-to-sql semantic parsing. ArXiv, abs/2012.04995.
- MuGER2: Multi-granularity evidence retrieval and reasoning for hybrid question answering. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6687–6697, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Tacr: A table-alignment-based cell-selection and reasoning model for hybrid question-answering. arXiv preprint arXiv:2305.14682.
- Fine-tuned LLMs know more, hallucinate less with few-shot sequence-to-sequence semantic parsing over Wikidata. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5778–5791, Singapore. Association for Computational Linguistics.
- AutoQA: From databases to QA semantic parsers with only synthetic training data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 422–434, Online. Association for Computational Linguistics.
- TaBERT: Pretraining for joint understanding of textual and tabular data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8413–8426, Online. Association for Computational Linguistics.
- CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1962–1979, Hong Kong, China. Association for Computational Linguistics.
- SParC: Cross-domain semantic parsing in context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4511–4523, Florence, Italy. Association for Computational Linguistics.
- COCO-DR: Combating distribution shift in zero-shot dense retrieval with contrastive and distributionally robust learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1462–1479, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Tablellama: Towards open large generalist models for tables.
- Reactable: Enhancing react for table question answering.
- "what do others think?": Task-oriented conversational modeling with subjective knowledge.
- Seq2sql: Generating structured queries from natural language using reinforcement learning.