HaluEval: Benchmark for LLM Hallucinations

Updated 20 February 2026

HaluEval is a large-scale benchmark for assessing LLM hallucination tendencies by quantifying frequency, type, and triggering factors.
It combines automatically generated and human-annotated examples across QA, dialogue, and summarization to ensure rigorous evaluation with defined metrics.
The benchmark has catalyzed advances in detection and mitigation strategies, fostering improvements with techniques like AMC decoding and iterative self-training.

HaluEval is a large-scale, multi-faceted benchmark specifically devised to evaluate the propensity of LLMs to produce hallucinations: outputs that contradict source material or contain unverifiable, fabricated information. Developed to address both the detection of and resistance to hallucination across diverse language modeling tasks, HaluEval provides a rigorous, human-annotated testbed that reveals not just overall frequency—but also the typology, triggers, and persistence—of hallucinations across model architectures, domains, and prompting paradigms (Li et al., 2023). The benchmark has achieved canonical status in hallucination research, underpinning both seminal detection studies and subsequent methodological advances (Chekalina et al., 2024, George et al., 2023, Chandler et al., 2024, Lamba et al., 9 Sep 2025, Marín, 15 Dec 2025, Lamba et al., 18 Nov 2025, Gu et al., 2024, Li et al., 17 Aug 2025, Li et al., 2024, Jia et al., 14 Jun 2025, Quevedo et al., 2024, Sinha, 17 Dec 2025, Zhu et al., 2024, Li et al., 3 Sep 2025).

1. Dataset Structure and Annotation Protocol

HaluEval consists of both automatically generated and human-annotated examples targeting three core tasks—question answering (QA), knowledge-grounded dialogue, and summarization (Li et al., 2023, Li et al., 2024). The full benchmark encompasses approximately 35,000 samples, partitioned as follows:

General User Queries: 5,000 real-world chat prompts, paired with ChatGPT-generated responses, each marked by three annotators for the presence and textual span of hallucinations.
Task-Specific Subsets: 10,000 examples each in QA, dialogue, and summarization, each containing both a gold-standard output and a carefully filtered hallucinated alternative. Patterns are controlled per task (e.g., in QA: comprehension, factual, specificity, inference errors).
Annotation for Task-Specific Data: Involves a structured sampling–then–filtering pipeline, where ChatGPT generates multiple candidate hallucinations, and another ChatGPT prompt (with access to gold answers) selects the most plausible-yet-wrong version.
Metric Definitions: Hallucination rate is formalized as $r_H = |H| / N$ for N examples and H hallucinated responses (Li et al., 2023). Accuracy, precision, recall, and F1 are reported for hallucination detection by external models (Jia et al., 14 Jun 2025).

HaluEval is complemented by HaluEval 2.0, which broadens domain coverage (biomedicine, finance, science, education, open-domain) and uses LLM-assisted fact extraction plus human validation to categorize hallucination types (entity-errors, relation-errors, overclaim, outdatedness, unverifiability, etc.) (Li et al., 2024).

2. Hallucination Typology, Symbolic Triggers, and Input Transformations

A defining feature of HaluEval is its attention to fine-grained triggers and taxonomy of hallucinations. Symbolic properties—modifiers, named entities, numbers, negation, and exceptions—are systematically annotated in each prompt using POS and NER tagging (Lamba et al., 18 Nov 2025, Lamba et al., 9 Sep 2025). These are then studied both in standard QA and in format variants such as multiple-choice and odd-one-out, enabling the isolation of symbolic effects from purely generative or format-based confounds (Lamba et al., 9 Sep 2025).

Persistent findings include:

Modifiers and Named Entities: Induce the highest hallucination rates (84–95% in Gemma-2-2B), slightly decreasing with model scale but remaining >77% in 27B-parameter variants (Lamba et al., 9 Sep 2025).
Negation and Exceptions: Consistently trigger critical instability in model attention, with early-layer variance spikes localized to layers 2–4, confirmed across both HaluEval and TruthfulQA (Lamba et al., 18 Nov 2025).
Task Format Robustness: High hallucination rates persist across open-ended QA, MCQ, and OOO formats, indicating that symbolic vulnerability is architectural, not format-specific (Lamba et al., 9 Sep 2025, Lamba et al., 18 Nov 2025).

3. Detection, Scoring, and Modeling Approaches

A wide range of detection methods have been validated on HaluEval, from statistical classifiers to state-of-the-art LLM ensembles and internal neuron tracing:

Simple Statistical Models: Token probability-based classifiers using as few as four features (minimum/average token probability, max probability deviation/spread) reach up to 98% accuracy in summarization and 95% in QA when paired with appropriate evaluator LLMs (Quevedo et al., 2024).
Factored and Modular Verification: Sentence-level claim decomposition followed by per-claim verification achieves 76.2% accuracy in summarization (George et al., 2023).
LLM Prompt Ensembles (DEEP): Diverse, chain-of-thought-driven prompts ensembled via classifiers or label models, achieving SOTA balanced accuracy (74.9%) in summarization, with robust calibration and no test-set tuning (Chandler et al., 2024).
Internal State Analysis: Methods aggregating attention, activation, and feed-forward states across all transformer layers, trained with a contrastive loss, achieve up to 69.1% binary classification accuracy in QA and 67.1% in summarization, outperforming external and final-layer-only approaches (Beigi et al., 2024).
Semantic and Embedding-Based Metrics: Conventional retrieval-augmented and embedding-cosine methods fail on HaluEval, with false-positive rates reaching 100% due to "semantic illusions"—the inability to distinguish factually incorrect but topically plausible errors from ground-truth responses (Sinha, 17 Dec 2025).
N-Gram Subspace Modeling: Singular-value decomposition of n-gram frequency tensors achieves up to 99.4% accuracy (summary task, G=40), strongly outperforming ROUGE, BERTScore, and even some LLM judges (Li et al., 3 Sep 2025).
Causal Reasoning and Graph-Based Decoding: Explicit training to construct causal DAGs and generate variable-level reasoning traces yields absolute HalluEval gains of 3–4 points (6% relative) over standard chain-of-thought, demonstrating the mitigation of logical hallucination through structure-aware supervision (Li et al., 17 Aug 2025).

4. Model Vulnerability, Symbolic Instability, and Localization

Systematic evaluations reveal that:

Symbolic Triggers Expose Architectural Weaknesses: Modifiers, negation, and exceptions rapidly destabilize attention and variance in early transformer layers, independent of model scale (Lamba et al., 18 Nov 2025, Lamba et al., 9 Sep 2025).
Attention Instability Localizes to Early Layers: Catastrophic attention variance peaks at layers 2–4 for all studied architectures, with later-stage corrections unable to compensate for early loss of symbolic grounding (Lamba et al., 18 Nov 2025).
Task and Input Length Effects: High hallucination rates remain robust against question or input length and are only modestly reduced by more structured input formats (Lamba et al., 18 Nov 2025, Lamba et al., 9 Sep 2025).
Persistence Across Scaling and Model Variants: Even at 27B parameters, symbolic hallucination rates remain well above 60%, with larger models only slightly tempering (but not eliminating) these errors (Lamba et al., 18 Nov 2025, Lamba et al., 9 Sep 2025).

5. Mitigation Strategies and Benchmarking Advances

HaluEval has served as the central evaluation bed for multiple hallucination mitigation frameworks:

Multi-Information Adapters (MALM): Integrate input, context, and external knowledge into a multi-graph attention network, achieving significant boosts over LLaMA-2 on all standard metrics, and strong preference by GPT-4 and human raters (Jia et al., 14 Jun 2025).
Curriculum DPO Alignment: Progressive training on synthetic, high-difficulty hallucinations yields substantial HaluEval improvements (e.g., HaluCheck 3B model, F1=0.753, narrowing the gap to GPT-4o), demonstrating the effectiveness of staged, hard-negative curricula (Pandit et al., 23 May 2025).
Absorbing Markov Chain (AMC) Decoding: By quantifying the information contribution of each prefix token and penalizing information loss, AMC decoding reduces hallucination on HaluEval in a model-agnostic, inference-only fashion (Wu et al., 2024).
Iterative Self-Training (ANAH-v2): Expectation Maximization (EM)-based scaling of hallucination annotators leads to a 7B-parameter model surpassing GPT-4 in zero-shot classification on HaluEval (accuracy: 81.54%), highlighting the value of specialized annotators for both evaluation and mitigation (Gu et al., 2024).

HaluEval has inspired a proliferation of derivative and complementary datasets:

HaluEval-Wild: Focuses on real-world, adversarial user queries from ShareGPT, categorized by five hallucination-inducing types (e.g., out-of-scope, complex reasoning), and demonstrates that strong conversational performance (e.g., on MT-Bench) does not guarantee factual reliability on wild queries (Zhu et al., 2024).
HaluEval 2.0: Extends domain granularity and annotation rigor, introducing micro/macroscopic hallucination rates and causal analyses of sources, mitigation, and prompt design (Li et al., 2024).
Symbolic Localization: Recent work reveals that hallucination is not a late-stage generative artifact but a fundamental symbolic processing failure, suggesting architectural attention to symbolic triggers, as opposed to brute-force scaling or post-hoc correction (Lamba et al., 18 Nov 2025, Lamba et al., 9 Sep 2025).

Key open challenges include:

Designing Deeper Symbolic and Causal Biases: To address persistent symbolic failures, architectural or training-time interventions targeting symbolic processing are necessary (Lamba et al., 18 Nov 2025, Li et al., 17 Aug 2025).
Evaluating Real-World and “Event” Hallucinations: Datasets like HaluEval-Wild and Hal-Eval (for LVLMs) extend the paradigm to more open-ended, multi-modal, and event-centric errors (Zhu et al., 2024, Jiang et al., 2024).
Black-Box and Few-Shot Settings: Methods such as DEEP (ensemble prompt) and token-probability detectors balance accuracy and computational cost, but robust, domain-adaptable, and explanation-rich solutions remain an active research frontier (Chandler et al., 2024, Quevedo et al., 2024).

HaluEval has thus established itself as the cornerstone of empirical research on LLM hallucination, providing the definitive standard for both the diagnosis and defense against factual inaccuracies in generative models. Its meticulous construction, diverse annotation schema, and public availability have catalyzed wide-ranging advances in hallucination detection, localization, and mitigation (Li et al., 2023).