"According to ...": Prompting Language Models Improves Quoting from Pre-Training Data (2305.13252v2)

Published 22 May 2023 in cs.CL and cs.AI

Abstract: LLMs may hallucinate and generate fake information, despite pre-training on factual data. Inspired by the journalistic device of "according to sources", we propose according-to prompting: directing LLMs to ground responses against previously observed text. To quantify this grounding, we propose a novel evaluation metric (QUIP-Score) that measures the extent to which model-produced answers are directly found in underlying text corpora. We illustrate with experiments on three corpora (Wikipedia, PubMed, and the U.S. legal tax code) that these prompts improve grounding under our metrics, with the additional benefit of often improving end-task performance. Furthermore, prompts that ask the model to decrease grounding (or to ground to other corpora) indeed decrease QUIP-Score, indicating the ability of LLMs to increase or decrease grounded generations on request.

PDF Abstract

Overview of "According to...: Prompting LLMs Improves Quoting from Pre-Training Data"

This paper explores innovative methods for steering LLMs towards generating more factually grounded content by leveraging their pre-training data. Through the adoption of a technique referred to as "according-to prompting," the researchers aim to mitigate the common problem of LLMs fabricating false information, known as hallucinations. The proposed method involves prompting LLMs with phrases such as "According to Wikipedia," thereby encouraging the models to ground their responses in specific sources within their pre-training data.

To quantitatively evaluate this approach, the researchers introduce a new metric dubbed the QUIP-Score. This metric measures the extent of n-gram overlap between model-generated outputs and text found in previously observed corpora, thereby serving as a proxy for how well LLM responses are grounded in their training data. The paper's experiments reveal that, when using according-to prompts, LLM outputs exhibit an increased alignment with prompts such as Wikipedia, PubMed articles, and U.S. legal tax codes, with QUIP-Scores reflecting these improvements.

Numerical Results

Empirical evaluations demonstrate that grounding prompts lead to an increase in QUIP-Score across a variety of datasets, achieving enhancements between 5% to 105% depending on the model and dataset in question. One of the standout findings is the versatility of the grounding prompts, which manage not only to increase the QUIP-Score but occasionally to enhance downstream task performances as well. The experiments cover a diverse set of corpora and demonstrate the method's applicability across domains. Notably, variations of the grounding prompts consistently yield higher QUIP-Scores compared to anti-grounding prompts, which affirm the models' ability to tune their responses according to specific instructions.

Methodological Insights

The methodology of the paper is rooted in leveraging n-gram overlaps to assess content groundedness, employing a Bloom filter-based Data Portrait for efficient membership querying in large corpora. This choice highlights the significance of computational efficiency in handling vast datasets within minimal memory usage. The grounding method, which applies instruction-tuning to LLMs, is theorized to exploit the models' memorization capabilities for generating more factual outputs.

The paper's findings underscore that larger models, which tend to memorize pre-training data more effectively, exhibit further gains when subjected to according-to prompts. Moreover, the consistency of improvements across different model scales and architectures points to the robustness of the proposed approach. The analysis also shows a correlation between entity popularity within pre-training data and QUIP-Scores, suggesting that the frequency of entity occurrences may enhance the recall of memorized information.

Implications and Future Developments

The implications of this research are substantial for improving factual accuracy in LLM-generated content. This paper complements existing works that focus on retrieval-augmented generation by providing a framework that enhances the factual nature of pre-training data recall without necessitating external retrieval systems. The ability to systematically steer LLM generations towards grounded content via prompting offers potential in various applications, such as enhancing the reliability of autonomous systems, improving user trust in AI outputs, and fostering fact-checking endeavors.

Looking forward, the integration of similar grounding strategies in future LLM designs could enhance their reliability further, especially as existing gaps in semantic understanding are addressed. This paper lays down a foundation for more precise grounding metrics and techniques that could be developed to improve LLM performance across diverse scenarios. Additionally, there could be further exploration in combining according-to prompts with dynamic adaptation based on user feedback or context-aware generation methods. As such, this research contributes meaningfully to the growing discourse on responsible AI development.