Knowing When to Ask -- Bridging Large Language Models and Data (2409.13741v1)

Published 10 Sep 2024 in cs.CL and cs.IR

Abstract: LLMs are prone to generating factually incorrect information when responding to queries that involve numerical and statistical data or other timely facts. In this paper, we present an approach for enhancing the accuracy of LLMs by integrating them with Data Commons, a vast, open-source repository of public statistics from trusted organizations like the United Nations (UN), Center for Disease Control and Prevention (CDC) and global census bureaus. We explore two primary methods: Retrieval Interleaved Generation (RIG), where the LLM is trained to produce natural language queries to retrieve data from Data Commons, and Retrieval Augmented Generation (RAG), where relevant data tables are fetched from Data Commons and used to augment the LLM's prompt. We evaluate these methods on a diverse set of queries, demonstrating their effectiveness in improving the factual accuracy of LLM outputs. Our work represents an early step towards building more trustworthy and reliable LLMs that are grounded in verifiable statistical data and capable of complex factual reasoning.

PDF Abstract

Enhancing Factual Accuracy in LLMs with Data Commons Integration

Generative LLMs have shown substantial promise in various natural language processing applications. However, their tendency to produce factually incorrect outputs, often termed as "hallucinations," remains a critical challenge, especially when dealing with numerical and statistical data. The paper "Knowing When to Ask - Bridging LLMs and Data" addresses this issue by proposing methodologies that enhance the factual reliability of LLMs by integrating them with the Data Commons, a comprehensive open-source repository of public statistics from trusted organizations such as the United Nations and the CDC.

The proposed approach explores two primary methods: Retrieval Interleaved Generation (RIG) and Retrieval Augmented Generation (RAG). In the RIG method, LLMs are fine-tuned to generate natural language queries designed to retrieve data from the Data Commons. This method utilizes a multi-model pipeline to convert these natural queries into structured data requests, thereby retrieving accurate statistical data. Conversely, the RAG method involves fetching relevant data tables from Data Commons to augment the original prompt given to the LLM, facilitating the generation of responses grounded in verifiable data.

Key Methodologies

Retrieval Interleaved Generation (RIG): RIG applies a tool-inspired approach where the LLM is trained to identify when and how to query external databases for accurate information. This method capitalizes on the ability to generate natural language queries that are used to retrieve statistics from Data Commons. The fine-tuning process involves an instruction-response dataset that enables the LLM to supplement generated numerical answers with data obtained from Data Commons, thus allowing for more accurate responses.
Retrieval Augmented Generation (RAG): RAG enhances the LLM's contextual understanding by providing auxiliary data retrieved from Data Commons. Here, the initial step involves generating specific queries that extract relevant statistical metrics via the Data Commons NL interface. These tables are then used to augment the LLM's input, ensuring the generation of informed, data-driven responses. The employment of a long-context LLM, such as the Gemini 1.5 Pro model, supports processing of extensive input data, crucial for handling broad comparison tasks spanning various datasets.

Experimental Results

The researchers evaluate both methods, focusing on metrics such as factual accuracy, Data Commons data coverage, and statistical claim validation. The RIG approach has shown an overall improvement in factual accuracy, with the Data Commons providing correct statistics in a significantly higher percentage of queries compared to the LLM-generated values alone. Numerical results indicated accuracy improvements from a baseline of 5-17% to about 58% when using RIG.

The RAG method, on the other hand, demonstrated high accuracy in citing statistical claims, with a 24-29% of the queries yielding statistically grounded responses when data tables from Data Commons were used. Importantly, the incorporation of verifiable data from trusted sources led to a marked preference among evaluators for data-enhanced responses over baseline model generations.

Implications and Future Directions

The integration of LLMs with structured databases represents a vital step toward increasing the trustworthiness and reliability of AI outputs, especially for tasks requiring precise numerical or statistical accuracy. Such advancements have practical implications in areas like policy analysis, economic modeling, and public health reporting where data integrity is paramount.

Future work will focus on expanding the dataset coverage of Data Commons and improving the natural language processing capabilities to further refine these bridging methodologies. In addition, enhancing the fine-tuning dataset in both size and scope will likely lead to more generalized and capable systems. Further analysis will explore different user interfaces and experiences to maximize the utility and interpretability of data-backed LLM outputs.

This research contributes to the broader goal of developing AI systems that can effectively utilize external knowledge repositories to mitigate inaccuracies, thus paving the way for more responsible and informed applications of LLMs.