HALoGEN: Fantastic LLM Hallucinations and Where to Find Them (2501.08292v1)

Published 14 Jan 2025 in cs.CL and cs.AI

Abstract: Despite their impressive ability to generate high-quality and fluent text, generative LLMs also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 LLMs, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy LLMs.

PDF Abstract

An Analysis of "Fantastic LLM Hallucinations and Where to Find Them"

The paper entitled "Fantastic LLM Hallucinations and Where to Find Them" addresses an important challenge in the deployment of generative LLMs: hallucination. Hallucinations refer to outputs by models that are not aligned with world knowledge or the input context. This paper makes a significant contribution by introducing a comprehensive benchmark designed to systematically paper hallucination behavior across diverse domains and contexts.

Benchmark Overview and Methodology

The researchers developed a hallucination benchmark, consisting of 10,923 prompts across nine distinct domains such as programming, scientific attribution, and summarization. This benchmark, known as HALOGEN, uses automatic high-precision verifiers to decompose LLM-generated content into atomic units, verifying each for factual accuracy against high-quality knowledge sources. The paper outlines a novel framework for evaluating hallucinations in LLMs, encompassing response-based tasks, where a model is expected to generate content, and refusal-based tasks, where it should abstain.

The paper describes three key metrics for evaluating LLMs: Hallucination Score, Response Ratio, and Utility Score. These metrics are used to assess 150,000 generations across 14 LLMs from leading model families, including GPT, Llama, and Mistral. Findings indicate that even the highest-performing models, such as GPT-4, exhibit substantial hallucination rates, with scores ranging from 4% to 86% depending on the domain. This indicates that hallucination is a pervasive issue in current models, highlighting the necessity for diverse, multi-domain benchmarks.

Error Classification and Source Analysis

LLM hallucinations were classified into three types based on their relation to training data: Type A errors arise from incorrect recollection of correct data, Type B errors stem from incorrect data within the training set, and Type C errors are fabrications. The analysis showed that hallucinations have multiple origins, varying significantly across domains. For instance, hallucinations in code-generation tasks often result from incorrect data in training corpora (Type B errors), while erroneous educational affiliations for US senators generally reflect incorrect recollection of correct information (Type A errors).

This classification elucidates the nuanced nature of hallucinations and suggests that a combination of content understanding and factual verification methods could mitigate their occurrence. The inclusion of diverse use cases such as scientific attribution is crucial as these errors, although not common, can significantly affect the credibility of models in professional contexts.

Implications and Future Directions

The benchmark and the accompanying analysis present substantial implications for both theoretical understanding and practical deployment of LLMs. By highlighting the multifaceted nature of hallucinations, the research underscores the need for targeted strategies in model development, incorporating both content understanding and external verification mechanisms. Future development in AI could benefit from improvements in data quality, refined pretraining processes, and enhanced evaluation frameworks that can dynamically adapt to the evolving landscape of LLM applications.

In conclusion, this paper provides foundational insights into hallucination behavior in LLMs, presenting a methodically constructed benchmark and rigorous analytical framework. These contributions are vital for advancing trustworthy AI systems and facilitating further research aimed at addressing the limitations of current generative models. The research presented in this paper lays the groundwork for developing more accurate and reliable LLMs, which will be critical as AI continues to integrate into complex real-world applications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Abhilasha Ravichander (33 papers)
Shrusti Ghela (1 paper)
David Wadden (24 papers)
Yejin Choi (287 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1879647162021417385

https://twitter.com/TheTuringPost/status/1881480855102841132

https://twitter.com/lasha_nlp/status/1885232602858348663

https://twitter.com/rohanpaul_ai/status/1881128979383607759