Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation (2305.14251v2)

Published 23 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Evaluating the factuality of long-form text generated by LLMs (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong LLM, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via pip install factscore.

PDF HTML Abstract

An Overview of FActScore: Evaluating Factual Precision in Text Generation

The paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation" addresses a significant challenge in the field of NLP: the evaluation of the factual correctness of texts generated by LLMs. It proposes a novel metric, FActScore, to address the inadequacies found in traditional binary evaluation methods and the constraints associated with human evaluations.

Problem Statement and Motivation

In text generation tasks, especially with long-form content, LLMs often produce a mix of correct and incorrect information. Binary measures, which judge the entire output as either factually correct or incorrect, fail to capture this nuance. Moreover, relying on human evaluators can be both time-consuming and financially prohibitive. The paper presents FActScore as a metric designed to quantify factual precision with finer granularity, thus providing more informative feedback about LLM outputs.

Methodology: FActScore Framework

FActScore disassembles a generated text into atomic facts and checks each piece against a reliable knowledge base, specifically the English Wikipedia. This scoring method provides a percentage of facts supported by this knowledge base, which gives a clearer picture of the factual reliability of the generated text. The authors conducted extensive human evaluations using this metric on biographies generated by leading NLP models such as InstructGPT, ChatGPT, and PerplexityAI, demonstrating significant differences in their factual accuracy. For instance, ChatGPT achieved an accuracy of only 58%.

In recognition of the need for scalable solutions, the authors also developed an automated model to estimate FActScore, achieving an error rate within 2% when compared to human evaluations. It uses a combination of retrieval techniques and evaluation by a potent LLM, enabling large-scale evaluation without prohibitive costs. This approach was applied to 6,500 text generations across multiple LLMs, revealing insights such as GPT-4 and ChatGPT's superiority in factual accuracy over public models like Vicuna and Alpaca.

Experimental Results and Implications

A key finding of this research was the variability in factual accuracy among different models. For high-profile models such as GPT-4, the paper showed that factual precision exceeded that of several publicly available counterparts, indicating a potential benchmark for future developments in text generation. These insights are crucial for both developers of LLMs and users who require reliable information synthesis from these models.

Implications and Future Directions

The introduction of FActScore has significant implications for the field of AI and NLP. Practically, it provides a tool that can be instrumental in assessing and improving the reliability of text generated by LLMs. Theoretically, it opens avenues for further research into fine-grained evaluation metrics and the development of models that can autonomously understand and enhance their factual accuracy.

The paper also highlights the potential for similar metrics in assessing other qualitative attributes of text generation, such as coherence and relevance. Future research could explore the adaptation of FActScore to other domains beyond biographies or experiment with different knowledge bases to accommodate various languages and cultural contexts.

In conclusion, while addressing the critical challenge of factual evaluation in text generation, the paper provides a framework that both practitioners and researchers can leverage to improve the factual accuracy of AI-generated content. The proposed methodologies and findings not only serve current technology but also lay a foundation for upcoming innovations in the field of LLMs.

PDF Markdown Bookmark Chat (Pro)

References (68)

Authors (9)

Sewon Min (45 papers)
Kalpesh Krishna (30 papers)
Xinxi Lyu (5 papers)
Mike Lewis (78 papers)
Wen-tau Yih (84 papers)
Pang Wei Koh (64 papers)
Mohit Iyyer (87 papers)
Luke Zettlemoyer (225 papers)
Hannaneh Hajishirzi (176 papers)

Citations (447)

View on Semantic Scholar

Tweets

https://twitter.com/oooob1k/status/1773369040251101589

https://twitter.com/shehzaadzd/status/1751917248829169730

https://twitter.com/shikibmehri/status/1897020169853264214

https://twitter.com/Yining__Lu/status/1902780126481244273