UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models (2402.14690v1)

Published 22 Feb 2024 in cs.CL

Abstract: LLMs may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or \textit{hallucination}. Existing research for evaluating the factuality of LLMs involves extracting fact claims using an LLM and verifying them against a predefined fact source. However, these evaluation metrics are task-specific, and not scalable, and the substitutability of fact sources in different tasks is under-explored. To address these challenges, we categorize four available fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge, along with five text generation tasks containing six representative datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources. We implement five evaluation scenarios based on this framework. Experimental results show that for most QA tasks, human-written evidence and reference documents are crucial, and they can substitute for each other in retrieval-augmented QA tasks. In news fact generation tasks, search engine results and LLM knowledge are essential. Our dataset and code are available at \url{https://github.com/WaldenRUC/UFO}.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (38)

Authors (4)

Zhaoheng Huang (3 papers)
Zhicheng Dou (113 papers)
Yutao Zhu (63 papers)
Ji-Rong Wen (299 papers)

Citations (1)

View on Semantic Scholar

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models (2402.14690v1)

Related Papers