FactBench: A Dynamic Benchmark for In-the-Wild LLM Factuality Evaluation
This paper presents a novel benchmark, FactBench, which seeks to address the challenge of factuality in LLMs (LMs). Given the increasing use of LMs and their inherent problem with generating false or irrelevant content, the authors propose a comprehensive evaluation framework named \system (Verification and Evidence RetrIeval for FactualitY evaluation). This framework evaluates factuality in LMs during real-world interactions, focusing on verifiability based on web evidence.
Key Contributions
- Dynamic Factuality Benchmark: FactBench is designed to be a dynamic and diverse factuality evaluation benchmark grounded in real-world requirements. Notably, this benchmark incorporates hallucination prompts—queries that result in the highest rates of false or inconclusive responses from LMs. This dynamic nature allows the benchmark to stay relevant as new factuality challenges emerge.
- Factuality Evaluation Pipeline (\system): \system offers a systematic pipeline for factuality evaluation. The framework categorizes LM-generated content into supported, unsupported, or undecidable categories based on evidence retrieved from the web. This ensures a higher correlation between factuality judgments by \system and human evaluations compared to existing methods.
- Empirical Evaluation and Findings: The authors benchmark popular LMs from families such as GPT, Gemini, and Llama3.1 against FactBench. The results reveal that proprietary models deliver better factuality, with factual accuracy decreasing from easy to hard hallucination prompts. For instance, Llama3.1-405B-Instruct performs comparably to or worse than Llama3.1-70B-Instruct, primarily due to the former's higher degree of subjectivity that leads to more undecidable content.
Implications and Future Directions
The introduction of FactBench and \system has significant implications for LM development and evaluation. The ability of FactBench to adaptively incorporate new hallucination prompts indicates a shift towards more responsive evaluation methods, capable of tracking the evolving capabilities and shortcomings of LMs. This dynamic benchmarking is particularly crucial as LMs are increasingly utilized in diverse and complex real-world applications.
Moreover, the insights gained from empirical evaluations highlight the need for LMs to balance factual accuracy with refusal strategies, such as those seen in Gemini1.5-Pro, which displayed a high refusal rate to uncertain prompts. This positions the refusal as a critical area for improving factual accuracy without compromising response quality.
In terms of future work, the authors suggest that more sophisticated factuality evaluation approaches could incorporate logical coherence across content units in LM responses. This would further strengthen the evaluation framework and ensure that LMs are assessed for both individual factual correctness and overall narrative consistency.
In conclusion, the paper presents a structured and adaptable approach to evaluating factuality in LMs, marking a step towards more reliable and context-aware LLMing in real-world scenarios. FactBench, with its dynamic nature and robust evaluation measures, is poised to become a pivotal tool in the ongoing development of factuality evaluation methods in artificial intelligence.