GRATH: Gradual Self-Truthifying for Large Language Models (2401.12292v2)

Published 22 Jan 2024 in cs.CL and cs.AI

Abstract: Truthfulness is paramount for LLMs as they are increasingly deployed in real-world applications. However, existing LLMs still struggle with generating truthful content, as evidenced by their modest performance on benchmarks like TruthfulQA. To address this issue, we propose GRAdual self-truTHifying (GRATH), a novel post-processing method to enhance truthfulness of LLMs. GRATH utilizes out-of-domain question prompts to generate pairwise truthfulness training data with each pair containing a question and its correct and incorrect answers, and then optimizes the model via direct preference optimization (DPO) to learn from the truthfulness difference between answer pairs. GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner. Empirically, we evaluate GRATH using different 7B-LLMs and compare with LLMs with similar or even larger sizes on benchmark datasets. Our results show that GRATH effectively improves LLMs' truthfulness without compromising other core capabilities. Notably, GRATH achieves state-of-the-art performance on TruthfulQA, with MC1 accuracy of 54.71% and MC2 accuracy of 69.10%, which even surpass those on 70B-LLMs.

References (49)

Authors (3)

Weixin Chen (10 papers)
Bo Li (1108 papers)
Dawn Song (229 papers)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces GRATH, a self-truthification framework that iteratively fine-tunes LLMs using direct preference optimization to enhance factual accuracy.
The methodology leverages pairwise training data to narrow domain gaps and expand distributional distances, achieving significant truthfulness improvements in just two DPO iterations.
Empirical results show GRATH achieving 54.71% MC1 and 69.10% MC2 accuracy on TruthfulQA, outperforming larger models like Llama2-Chat-70B by over 23%.

Introduction

LLMs play a pivotal role in various applications. Given their extensive utilization, the truthfulness of their responses has emerged as a crucial quality metric. The development of models capable of discerning and disseminating accurate information is of particular importance, especially in domains where the integrity of the output has significant implications. Indeed, recent benchmarks such as TruthfulQA have been established to appraise the veracity of model-generated content.

Gradual Self-Truthifying

Against this backdrop, a notable innovation comes in the form of GRAdual self-truTHifying (GRATH), a post-processing methodology aimed at incrementally refining the truthfulness of LLMs. Central to GRATH is the generation of pairwise truthfulness data via LLM prompting, accomplishing a self-supervised fine-tuning process through direct preference optimization (DPO). Essentially, the LLM is initially generated with correct and incorrect answers, creating training pairs that serve as the basis for successive DPO fine-tuning. This iterative cycle culminates in an amplification of the model's capacity to deliver truthful answers.

Empirical Evaluation

The proficiency of GRATH is rigorously evaluated across various LLMs including 7B-sized models. When gauged against other LLMs, even those with substantially greater parameters, GRATH demonstrates a commendable improvement in truthfulness metrics. Specifically, GRATH achieves a MC1 accuracy of 54.71% and a MC2 accuracy of 69.10% on the TruthfulQA benchmark, surpassing Llama2-Chat-70B by over 23%. Furthermore, this enhancement in truthfulness is not at the expense of other capabilities like reasoning and common-sense understanding, as validated by benchmarks such as ARC, HellaSwag, and MMLU.

Towards a Better Understanding

GRATH's methodology has provoked an in-depth analysis aiming to unravel the workings behind the enhanced truthfulness. Two facets are identified - the domain gap between training and testing data, and the distributional distance between correct and incorrect answers within training pairs. Insights reveal that a narrowing of the domain gap and an expansion of distributional distance substantially benefit the truthfulness of the model. A further dimension is the iterative optimization process of GRATH which showcases the method's expediency, generally requiring a mere two DPO executions to achieve state-of-the-art performance.

Conclusion

GRATH brings forth a highly potent technique for improving the trustworthiness of LLMs and does so without the need for labor-intensive annotations. It also proves to be OOD-resilient and enhances models' truthfulness with remarkable efficiency. With the versatility to adapt to various alignment techniques, GRATH sets the stage for further explorations around multi-attribute optimization for LLMs. As the landscape of LLMs expands and their application domains grow more critical, tools like GRATH will be instrumental in ensuring that the lines between machine-generated content and factual truthfulness remain unblurred.

PDF Markdown

Related Papers

Tweets

https://twitter.com/chenweixin107/status/1751287244214673918

YouTube

Show All Videos