Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation (2410.22257v1)

Published 29 Oct 2024 in cs.CL
FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Abstract: LLMs (LMs) are widely used by an increasing number of users, underscoring the challenge of maintaining factuality across a broad range of topics. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on the retrieved evidence from the Web. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts" across diverse topics, i.e., those eliciting the highest rates of incorrect and inconclusive LM responses. These prompts form FactBench, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama3.1 family on FactBench, yielding the following key findings: (i) Proprietary models exhibit better factuality, with performance declining from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual accuracy than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases. Our code and data are publicly available at https://huggingface.co/spaces/launch/factbench.

FactBench: A Dynamic Benchmark for In-the-Wild LLM Factuality Evaluation

This paper presents a novel benchmark, FactBench, which seeks to address the challenge of factuality in LLMs (LMs). Given the increasing use of LMs and their inherent problem with generating false or irrelevant content, the authors propose a comprehensive evaluation framework named \system (Verification and Evidence RetrIeval for FactualitY evaluation). This framework evaluates factuality in LMs during real-world interactions, focusing on verifiability based on web evidence.

Key Contributions

  1. Dynamic Factuality Benchmark: FactBench is designed to be a dynamic and diverse factuality evaluation benchmark grounded in real-world requirements. Notably, this benchmark incorporates hallucination prompts—queries that result in the highest rates of false or inconclusive responses from LMs. This dynamic nature allows the benchmark to stay relevant as new factuality challenges emerge.
  2. Factuality Evaluation Pipeline (\system): \system offers a systematic pipeline for factuality evaluation. The framework categorizes LM-generated content into supported, unsupported, or undecidable categories based on evidence retrieved from the web. This ensures a higher correlation between factuality judgments by \system and human evaluations compared to existing methods.
  3. Empirical Evaluation and Findings: The authors benchmark popular LMs from families such as GPT, Gemini, and Llama3.1 against FactBench. The results reveal that proprietary models deliver better factuality, with factual accuracy decreasing from easy to hard hallucination prompts. For instance, Llama3.1-405B-Instruct performs comparably to or worse than Llama3.1-70B-Instruct, primarily due to the former's higher degree of subjectivity that leads to more undecidable content.

Implications and Future Directions

The introduction of FactBench and \system has significant implications for LM development and evaluation. The ability of FactBench to adaptively incorporate new hallucination prompts indicates a shift towards more responsive evaluation methods, capable of tracking the evolving capabilities and shortcomings of LMs. This dynamic benchmarking is particularly crucial as LMs are increasingly utilized in diverse and complex real-world applications.

Moreover, the insights gained from empirical evaluations highlight the need for LMs to balance factual accuracy with refusal strategies, such as those seen in Gemini1.5-Pro, which displayed a high refusal rate to uncertain prompts. This positions the refusal as a critical area for improving factual accuracy without compromising response quality.

In terms of future work, the authors suggest that more sophisticated factuality evaluation approaches could incorporate logical coherence across content units in LM responses. This would further strengthen the evaluation framework and ensure that LMs are assessed for both individual factual correctness and overall narrative consistency.

In conclusion, the paper presents a structured and adaptable approach to evaluating factuality in LMs, marking a step towards more reliable and context-aware LLMing in real-world scenarios. FactBench, with its dynamic nature and robust evaluation measures, is poised to become a pivotal tool in the ongoing development of factuality evaluation methods in artificial intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. AI@Meta. Llama3 model. https://ai.meta.com/blog/meta-llama-3/, 2024.
  2. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
  3. Felm: Benchmarking factuality evaluation of large language models, 2023. URL https://arxiv.org/abs/2310.00741.
  4. Jonathan Goldsmith. Wikipedia: A pythonic wrapper for the wikipedia api. https://github.com/goldsmith/Wikipedia, 2014.
  5. Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure, 2022. URL https://arxiv.org/abs/2203.05794.
  6. Molecular facts: Desiderata for decontextualization in llm fact verification, 2024. URL https://arxiv.org/abs/2406.20079.
  7. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, 2023. URL https://arxiv.org/abs/2311.05232.
  8. Towards automated factchecking: Developing an annotation schema and benchmark for consistent automated claim detection, 2020. URL https://arxiv.org/abs/1809.08193.
  9. Halueval: A large-scale hallucination evaluation benchmark for large language models, 2023. URL https://arxiv.org/abs/2305.11747.
  10. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URL https://arxiv.org/abs/2109.07958.
  11. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  4791–4797, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.291. URL https://aclanthology.org/2023.emnlp-main.291.
  12. Data contamination: From memorization to exploitation, 2022. URL https://arxiv.org/abs/2203.08242.
  13. ExpertQA: Expert-curated questions and attributed answers. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.  3025–3045, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.167. URL https://aclanthology.org/2024.naacl-long.167.
  14. A hybrid approach to hierarchical density-based cluster selection. In 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI). IEEE, September 2020. doi: 10.1109/mfi49285.2020.9235263. URL http://dx.doi.org/10.1109/MFI49285.2020.9235263.
  15. Umap: Uniform manifold approximation and projection for dimension reduction, 2020. URL https://arxiv.org/abs/1802.03426.
  16. Meta. Introducing llama 3.1: Our most capable models to date. https://ai.meta.com/blog/meta-llama-3-1/, 2024. Accessed: 2024-09-10.
  17. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  12076–12100, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.741. URL https://aclanthology.org/2023.emnlp-main.741.
  18. Overview of the clef-2022 checkthat! lab task 1 on identifying relevant claims in tweets. In Conference and Labs of the Evaluation Forum, 2022. URL https://api.semanticscholar.org/CorpusID:251472020.
  19. AFaCTA: Assisting the annotation of factual claim detection with reliable LLM annotators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1890–1912, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.104. URL https://aclanthology.org/2024.acl-long.104.
  20. OpenAI. New embedding models and api updates. https://openai.com/index/ new-embedding-models-and-api-updates/, 2024a.
  21. OpenAI. Gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024b. Version: 2024-05-13.
  22. OpenAI. New models and developer products announced at devday. https://openai.com/blog/ new-models-and-developer-products-announced-at-devday, 2024c. Version: turbo-2024-04-09.
  23. On the definition of prescriptive annotation guidelines for language-agnostic subjectivity detection. 04 2023.
  24. Veriscore: Evaluating the factuality of verifiable claims in long-form text generation, 2024. URL https://arxiv.org/abs/2406.19276.
  25. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805.
  26. The spread of true and false news online. Science, 359(6380):1146–1151, 2018. doi: 10.1126/science.aap9559. URL https://www.science.org/doi/abs/10.1126/science.aap9559.
  27. Factcheck-bench: Fine-grained evaluation benchmark for automatic fact-checkers, 2024a. URL https://arxiv.org/abs/2311.09000.
  28. Factcheck-bench: Fine-grained evaluation benchmark for automatic fact-checkers, 2024b. URL https://arxiv.org/abs/2311.09000.
  29. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024a. Curran Associates Inc. ISBN 9781713871088.
  30. Long-form factuality in large language models, 2024b. URL https://arxiv.org/abs/2403.18802.
  31. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2024. URL https://arxiv.org/abs/2309.11998.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Farima Fatahi Bayat (6 papers)
  2. Lechen Zhang (9 papers)
  3. Sheza Munir (3 papers)
  4. Lu Wang (329 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com