Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data (2404.03862v3)

Published 5 Apr 2024 in cs.CL

Abstract: To trust the fluent generations of LLMs, humans must be able to verify their correctness against trusted, external sources. Recent efforts, such as providing citations via retrieved documents or post-hoc provenance, enhance verifiability but provide no guarantees on their correctness. To address these limitations, we tackle the verifiability goal with a different philosophy: trivializing the verification process by developing models that quote verbatim statements from trusted sources in their pre-training data. We propose Quote-Tuning, which demonstrates the feasibility of aligning models to quote. The core of Quote-Tuning is a fast membership inference function that efficiently verifies text against trusted corpora. We leverage this tool to design a reward function to quantify quotes in model responses, and curate datasets for preference learning. Experiments show that Quote-Tuning significantly increases verbatim quotes from high-quality documents by up to 130% relative to base models while maintaining response quality. Quote-Tuning is applicable in different tasks, generalizes to out-of-domain data and diverse model families, and provides additional benefits to truthfulness. Our method not only serves as a hassle-free method to increase quoting but also opens up avenues for improving LLM trustworthiness through better verifiability.

PDF Abstract

Enhancing Verifiability in LLMs through Quote-Tuning

Introduction

LLMs have evolved remarkably over recent years, showcasing a notable capability in generating human-like text across a vast array of domains. However, their proficiency comes with a fundamental caveat — these models often generate plausible yet factually incorrect content, a phenomenon commonly referred to as hallucination. Addressing this issue requires mechanisms that can effectively verify the accuracy of generated content against trusted sources. Traditional approaches have explored augmenting LLMs with external citation mechanisms or enriching post-generation content with provenance data. These strategies, while useful, often fall short as they can introduce errors or irrelative citations that complicate the verification process further.

Methodology

In contrast to prior methods, this research proposes an innovative approach, namely Quote-Tuning, that simplifies verifiability by engineering models to quote verbatim statements from trusted sources incorporated in pre-training data. Quote-Tuning leverages the inherent capability of LLMs to memorize and recall information from their voluminous training datasets—ranging from trustworthy sources like Wikipedia. Employing a novel mechanism that quantifies quoting against large corpora using efficient membership inference tools, this method treats the quotability of an LLM output as an implicit reward signal. It synthesizes a synthetic preference dataset advocating for quoting without necessitating human annotation. The underlying model is then fine-tuned to align its generation preference towards quoting using preference optimization techniques.

The procedure involves generating multiple responses to a given prompt, constructing synthetic preference data by assessing and ranking these responses based on their quoting percentage, and iteratively tuning the model to favor quoting more from the designated corpus.

Experimental Findings

The evaluation carried out across diverse domains, including long-form question answering and open-ended text completion, demonstrates a significant improvement in the model's ability to quote directly from high-quality pre-training documents — showing a 55% to 130% relative increase in quoting over baseline models. This enhancement was consistent not only within domains seen during training but also extended to out-of-domain evaluations, indicating a robust generalization of quoting preference. Furthermore, besides increasing quoting, Quote-Tuning has also shown promising improvements in the truthfulness of the generated content, suggesting a valuable side benefit of this approach.

Implications and Future Directions

The introduction of Quote-Tuning as a method for improving verifiability in LLMs opens up new avenues for developing more trustworthy and reliable AI systems. By utilizing the rich knowledge embedded in the pre-training data, this approach provides a straightforward yet effective means to verify model-generated content. Looking ahead, there's potential to explore further enhancements in quoting mechanisms, such as differentiating between quoting large segments versus multiple shorter ones, and understanding the impact of Quote-Tuning on models trained with instruction following capabilities.

Additionally, while this paper focused solely on exploiting parametric knowledge for quoting, future work could investigate how non-parametric knowledge bases can be combined with Quote-Tuning to enrich the model's ability to access and reference factual information. Another compelling direction is examining the intersection of Quote-Tuning with mechanisms for generating citations, potentially providing a more integrated solution for model verifiability and attribution.

In summary, Quote-Tuning represents a significant step forward in addressing the challenge of verifiability in LLMs. It sets a foundation for further research into leveraging the intrinsic knowledge of these models to produce not only fluent but also verifiable and trustworthy content.