HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models (2305.11747v3)

Published 19 May 2023 in cs.CL

Abstract: LLMs, such as ChatGPT, are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for LLMs (HaluEval), a large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucination. To generate these samples, we propose a ChatGPT-based two-step framework, i.e., sampling-then-filtering. Besides, we also hire some human labelers to annotate the hallucinations in ChatGPT responses. The empirical results suggest that ChatGPT is likely to generate hallucinated content in specific topics by fabricating unverifiable information (i.e., about $19.5\%$ responses). Moreover, existing LLMs face great challenges in recognizing the hallucinations in texts. However, our experiments also prove that providing external knowledge or adding reasoning steps can help LLMs recognize hallucinations. Our benchmark can be accessed at https://github.com/RUCAIBox/HaluEval.

PDF Abstract

HaluEval: A Hallucination Evaluation Benchmark for LLMs

The paper introduces HaluEval, a large-scale benchmark designed to evaluate hallucination tendencies in LLMs such as ChatGPT. LLMs, while proficient in various NLP applications, are known to generate "hallucinations"—content that conflicts with source material or cannot be verified. This evaluation aims to explore the types and extent of hallucinations LLMs produce.

Methodology

HaluEval consists of 35,000 samples across tasks like question answering, knowledge-grounded dialogue, and text summarization. These tasks are divided into 5,000 general user queries and 30,000 task-specific examples. The benchmark employs a two-stage framework for automatic generation: sampling and filtering. ChatGPT is utilized for generating hallucinated content, which is then filtered for plausibility and difficulty.

For human annotation, the dataset includes 5,000 responses annotated with whether they contain hallucinations. These annotations guide the assessment of LLMs' recognition capabilities.

Empirical Results

The analysis reveals that ChatGPT generates unverifiable content approximately 19.5% of the time. The findings indicate LLMs struggle to detect hallucinations effectively, with ChatGPT achieving only 62.59% accuracy in question answering. Incorporating external knowledge and structured reasoning improves performance, suggesting pathways to mitigate hallucinations.

Insights and Implications

The benchmark offers a comprehensive evaluation framework that enhances understanding of hallucination patterns in LLMs. The results underscore the importance of providing LLMs with auxiliary information to refine their outputs and minimize factual errors. This research has crucial implications for deploying LLMs in sensitive applications where accuracy is paramount.

Future Directions

Further research could explore integrating dynamic knowledge retrieval systems with LLMs to address hallucinations more robustly. Additionally, expanding the benchmark to include more varied datasets and hallucination types can deepen insights and improve LLM design.

HaluEval presents a critical step towards addressing the reliability of LLMs, paving the way for future improvements in AI technology.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Junyi Li (92 papers)
Xiaoxue Cheng (12 papers)
Wayne Xin Zhao (196 papers)
Jian-Yun Nie (70 papers)
Ji-Rong Wen (299 papers)

Citations (178)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - RUCAIBox/HaluEval: This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models. (387 stars)