FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation (2410.22257v1)

Published 29 Oct 2024 in cs.CL

Abstract: LLMs (LMs) are widely used by an increasing number of users, underscoring the challenge of maintaining factuality across a broad range of topics. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on the retrieved evidence from the Web. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts" across diverse topics, i.e., those eliciting the highest rates of incorrect and inconclusive LM responses. These prompts form FactBench, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama3.1 family on FactBench, yielding the following key findings: (i) Proprietary models exhibit better factuality, with performance declining from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual accuracy than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases. Our code and data are publicly available at https://huggingface.co/spaces/launch/factbench.

PDF HTML Abstract

FactBench: A Dynamic Benchmark for In-the-Wild LLM Factuality Evaluation

This paper presents a novel benchmark, FactBench, which seeks to address the challenge of factuality in LLMs (LMs). Given the increasing use of LMs and their inherent problem with generating false or irrelevant content, the authors propose a comprehensive evaluation framework named \system (Verification and Evidence RetrIeval for FactualitY evaluation). This framework evaluates factuality in LMs during real-world interactions, focusing on verifiability based on web evidence.

Key Contributions

Dynamic Factuality Benchmark: FactBench is designed to be a dynamic and diverse factuality evaluation benchmark grounded in real-world requirements. Notably, this benchmark incorporates hallucination prompts—queries that result in the highest rates of false or inconclusive responses from LMs. This dynamic nature allows the benchmark to stay relevant as new factuality challenges emerge.
Factuality Evaluation Pipeline (\system): \system offers a systematic pipeline for factuality evaluation. The framework categorizes LM-generated content into supported, unsupported, or undecidable categories based on evidence retrieved from the web. This ensures a higher correlation between factuality judgments by \system and human evaluations compared to existing methods.
Empirical Evaluation and Findings: The authors benchmark popular LMs from families such as GPT, Gemini, and Llama3.1 against FactBench. The results reveal that proprietary models deliver better factuality, with factual accuracy decreasing from easy to hard hallucination prompts. For instance, Llama3.1-405B-Instruct performs comparably to or worse than Llama3.1-70B-Instruct, primarily due to the former's higher degree of subjectivity that leads to more undecidable content.

Implications and Future Directions

The introduction of FactBench and \system has significant implications for LM development and evaluation. The ability of FactBench to adaptively incorporate new hallucination prompts indicates a shift towards more responsive evaluation methods, capable of tracking the evolving capabilities and shortcomings of LMs. This dynamic benchmarking is particularly crucial as LMs are increasingly utilized in diverse and complex real-world applications.

Moreover, the insights gained from empirical evaluations highlight the need for LMs to balance factual accuracy with refusal strategies, such as those seen in Gemini1.5-Pro, which displayed a high refusal rate to uncertain prompts. This positions the refusal as a critical area for improving factual accuracy without compromising response quality.

In terms of future work, the authors suggest that more sophisticated factuality evaluation approaches could incorporate logical coherence across content units in LM responses. This would further strengthen the evaluation framework and ensure that LMs are assessed for both individual factual correctness and overall narrative consistency.

In conclusion, the paper presents a structured and adaptable approach to evaluating factuality in LMs, marking a step towards more reliable and context-aware LLMing in real-world scenarios. FactBench, with its dynamic nature and robust evaluation measures, is poised to become a pivotal tool in the ongoing development of factuality evaluation methods in artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

References (31)

Authors (4)

Farima Fatahi Bayat (6 papers)
Lechen Zhang (9 papers)
Sheza Munir (3 papers)
Lu Wang (329 papers)

Tweets

https://twitter.com/FarimaFB/status/1851752079318261933

FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation (2410.22257v1)

FactBench: A Dynamic Benchmark for In-the-Wild LLM Factuality Evaluation

Key Contributions

Implications and Future Directions

Related Papers

Tweets