SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models (2311.09818v2)

Published 16 Nov 2023 in cs.CL and cs.PL

Abstract: While most conversational agents are grounded on either free-text or structured knowledge, many knowledge corpora consist of hybrid sources. This paper presents the first conversational agent that supports the full generality of hybrid data access for large knowledge corpora, through a language we developed called SUQL (Structured and Unstructured Query Language). Specifically, SUQL extends SQL with free-text primitives (summary and answer), so information retrieval can be composed with structured data accesses arbitrarily in a formal, succinct, precise, and interpretable notation. With SUQL, we propose the first semantic parser, an LLM with in-context learning, that can handle hybrid data sources. Our in-context learning-based approach, when applied to the HybridQA dataset, comes within 8.9% exact match and 7.1% F1 of the SOTA, which was trained on 62K data samples. More significantly, unlike previous approaches, our technique is applicable to large databases and free-text corpora. We introduce a dataset consisting of crowdsourced questions and conversations on Yelp, a large, real restaurant knowledge base with structured and unstructured data. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 90.3% of the time, compared to 63.4% for a baseline based on linearization.

References (54)

Citations (1)

View on Semantic Scholar

Summary

The paper presents SUQL as an innovative query language that augments SQL with summary and answer primitives to query hybrid data effectively.
It demonstrates an LLM-based semantic parser using few-shot learning to convert conversational queries into precise SUQL commands.
Experimental results on HybridQA and Yelp datasets reveal competitive performance, achieving up to 90.3% accuracy in real-world scenarios.

SUQL: Conversational Search over Hybrid Data with LLMs

The paper introduces SUQL, a novel query language bridging conversational agents and hybrid data sources. SUQL extends traditional SQL capabilities by integrating free-text primitives like summary and answer, facilitating seamless access to both structured and unstructured data. This innovative approach enhances LLMs' (LLMs) ability to interpret and execute complex queries involving a combination of structured and unstructured data, such as those found in extensive knowledge corpora.

Key Contributions

SUQL Design: SUQL is a syntactically precise and semantically expressive language that augments SQL with the ability to operate on free text. The new primitives, summary and answer, allow users to pose queries that naturally span structured and unstructured data attributes. This offers a clear advantage over traditional methods like linearization, where data complexity and performance issues may arise.
Semantic Parsing with LLMs: The paper demonstrates the capability of an in-context learning-based semantic parser to effectively translate conversational queries into SUQL, capitalizing on LLMs' familiarity with SQL syntax. The parser utilizes few-shot learning, reducing the need for extensive training datasets.
Performance Validation: Experiments conducted on the HybridQA dataset show that the proposed approach achieves competitive results, with a performance margin close to the state-of-the-art without extensive training data. Specifically, SUQL-based systems achieve 59.0% exact match and 68.4% F1 on the test set.
Real-world Applicability: The introduction of a dataset involving crowdsourced conversations about real restaurants (from Yelp) further supports SUQL's practicality and robustness. The results show that the SUQL-based conversational agent can accurately return results for user queries 90.3% of the time.

Numerical and Experimental Insights

The LLM-semantic parser achieves high accuracy with significant savings in data samples compared to state-of-the-art. The exact match and F1 scores are competitive, illustrating the efficiency and effectiveness of SUQL in real-time scenarios.
In the Yelp dataset experiment, the SUQL-based system demonstrated a marked improvement over traditional linearization-based methods, particularly in complex queries involving hybrid data sources.

Implications and Future Directions

The introduction of SUQL offers several practical and theoretical implications. By enabling precise, interpretable access to hybrid data, SUQL could transform how conversational agents interact with diverse databases, finding applications in industries like healthcare, finance, and customer service. The framework promises powerful data interrogation capabilities which were previously unattainable with limitations inherent in extant methods.

Future research directions could explore the refinement of SUQL's semantic parsing capabilities and its adaptation to more domain-specific tasks (e.g., legal or biomedical databases). Additionally, the potential integration of SUQL into industry-standard databases could foster more intuitive user-database interaction paradigms, facilitating seamless, natural language-based data management.

In conclusion, SUQL positions itself as a crucial development in conversational agent technologies, promising enhanced expressiveness and accuracy in querying hybrid data environments. Its design and implementation underscore a significant step forward in harnessing the full potential of LLMs in real-world applications.