Factuality Enhanced Language Models for Open-Ended Text Generation (2206.04624v3)

Published 9 Jun 2022 in cs.CL, cs.AI, cs.CY, and cs.LG

Abstract: Pretrained LLMs (LMs) are susceptible to generate text with nonfactual information. In this work, we measure and improve the factual accuracy of large-scale LMs for open-ended text generation. We design the FactualityPrompts test set and metrics to measure the factuality of LM generations. Based on that, we study the factual accuracy of LMs with parameter sizes ranging from 126M to 530B. Interestingly, we find that larger LMs are more factual than smaller ones, although a previous study suggests that larger LMs can be less truthful in terms of misconceptions. In addition, popular sampling algorithms (e.g., top-p) in open-ended text generation can harm the factuality due to the ''uniform randomness'' introduced at every sampling step. We propose the factual-nucleus sampling algorithm that dynamically adapts the randomness to improve the factuality of generation while maintaining quality. Furthermore, we analyze the inefficiencies of the standard training method in learning correct associations between entities from factual text corpus (e.g., Wikipedia). We propose a factuality-enhanced training method that uses TopicPrefix for better awareness of facts and sentence completion as the training objective, which can vastly reduce the factual errors. We release our code and FactualityPrompts benchmark at: https://github.com/nayeon7lee/FactualityPrompt.

PDF Abstract

Enhancing Factuality in Open-Ended Text Generation: Insights from Factuality Enhanced LLMs

The advent of large-scale pre-trained LLMs (LMs) has revolutionized natural language generation, but it has also highlighted a critical challenge—factual accuracy in generated content. The paper "Factuality Enhanced LLMs for Open-Ended Text Generation" by Lee et al. focuses on addressing the susceptibility of these models to generate nonfactual information. The authors propose a comprehensive framework for measuring and improving the factuality of LMs, targeting the complex task of open-ended text generation.

Key Contributions and Methodologies

The research makes several key contributions to the field:

Benchmarking Factuality: The authors introduce a benchmark named FactualityPrompts, comprising both factual and nonfactual prompts. This benchmark is used to systematically assess the factual accuracy of LMs with varying parameter sizes. Their analysis spans LMs from 126M to 530B parameters, revealing that larger models tend to generate more factual content despite previous suggestions of larger models harboring more misconceptions.
Decoding Algorithms: The paper scrutinizes popular sampling algorithms like top- $p$ sampling, which are commonly used in open-ended text generation. The paper identifies that these algorithms can inadvertently introduce "uniform randomness," harming factual accuracy. To counteract this, the authors propose the factual-nucleus sampling algorithm, which dynamically adapts randomness to enhance factuality without sacrificing generation quality.
Factuality-Enhanced Training: The inefficiencies of standard training methods in learning factual associations from corpora like Wikipedia are analyzed. The researchers introduce a factuality-enhanced training method, leveraging a novel {TopicPrefix} and a sentence completion task as training objectives. This approach significantly reduces factual errors in model outputs.
Empirical Evaluation: The empirical evaluation of their methods shows significant improvements in factual accuracy. Notably, the proposed trust-enhanced 530B LM reduces named-entity factual errors from 33.3% to 14.5%, a noteworthy advancement in the reliability of generated content.

Implications and Future Directions

The research presents significant implications for both theoretical and practical domains. Theoretically, it challenges and refines our understanding of the relationship between model size and factuality, suggesting that larger models can indeed improve factual accuracy when enhanced effectively. Practically, the strategies proposed could be adopted to improve the deployment safety of generative models in real-world applications such as content creation and dialogue systems.

Looking forward, this work opens avenues for future exploration, particularly in the area of improving the factual reasoning capabilities of LMs. Further research could investigate combining external knowledge sources with parametric improvements, or even more sophisticated training and sampling methods to mitigate factual discrepancies.

Conclusion

This paper contributes to narrowing the gap between human-like generation capabilities of LMs and their factual reliability. By addressing both intrinsic model improvements and decoding strategies, it offers a robust framework for enhancing the factual accuracy of LMs. As generative models continue to integrate into varying applications, such advancements are pivotal in ensuring that they not only produce coherent and contextually appropriate outputs but also those that are factually grounded.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Nayeon Lee (28 papers)
Wei Ping (51 papers)
Peng Xu (357 papers)
Mostofa Patwary (34 papers)
Pascale Fung (150 papers)
Mohammad Shoeybi (60 papers)
Bryan Catanzaro (123 papers)

Citations (172)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - nayeon7lee/FactualityPrompt (69 stars)

Tweets

https://twitter.com/WhatAreDeOdds/status/1858777001248584067

YouTube

Show All Videos