Human Evaluation of Hallucinations

Updated 2 February 2026

The topic explores human evaluation methods for LLM-generated hallucinations by defining intrinsic and extrinsic error types and establishing robust evaluative frameworks.
It reviews annotation protocols that combine calibrated human judgment and LLM-based screening to quantify hallucination rates using metrics like precision, recall, and Likert scales.
Future directions address challenges such as metric misalignment, taxonomy inconsistencies, and expert scalability, offering actionable insights for advancing research.

Hallucination in the context of LLMs refers to the generation of content that is plausible yet nonfactual, unsupported, or internally inconsistent with either explicit prompts or known ground truth. Human evaluation of hallucinations remains a critical challenge due to the ill-posed nature of “factuality,” the inherent fluency of generated text, and the diversity of hallucination types across domains and tasks. This entry surveys methodological foundations, benchmark protocols, annotation schemes, evaluative metrics, best practices, and current limitations in human-centric assessment of hallucinations, with emphasis on recent research in general, domain-specific, and creative/epistemic contexts.

1. Conceptualization and Formal Frameworks

Hallucination is rigorously defined as any output diverging from the ground-truth response function $f(s)$ for an input string $s$ , i.e., when an LLM state $h[i](s) \neq f(s)$ for some $i$ (training step). This definition admits two primary taxonomies (Cossio, 3 Aug 2025):

Intrinsic hallucinations: Contradict the explicit input or context (prompt-inconsistent).
Extrinsic hallucinations: Introduce content not grounded or verifiable in the prompt, possibly fabricated but not contradicting the source.
Factuality versus faithfulness: Factual hallucination violates world truth, while faithfulness is concerned with strict adherence to provided context.

The theoretical inevitability of hallucinations in computable LLMs is established via diagonalization arguments, implying no model can self-eliminate all hallucinations (Cossio, 3 Aug 2025).

2. Human Annotation Protocols and Guidelines

Human annotation is predicated on explicit definitions and calibrated instruction sets. Protocols generally involve:

Type and severity labeling: Binary labels (Hallucinated vs. Faithful), sometimes supplemented by severity scales (e.g., 1–3 or 1–5), are standard (Islam et al., 2023, Cossio, 3 Aug 2025, Zuo et al., 2024).
Attribute tagging and free-form errors: Where expressivity is crucial, hallucinations are pinpointed using natural-language error descriptions (“The answer claims X, but source says Y”); this free-form schema is adopted for fine-grained benchmarks (Pesiakhovsky et al., 26 Sep 2025).
Pilot calibration and agreement: Annotators undergo calibration sessions using gold-standard examples, followed by inter-annotator agreement measurement (Cohen’s κ or Krippendorff’s α) to ensure annotation reliability, with κ ≥ 0.7 typically targeted (Zuo et al., 2024, Cossio, 3 Aug 2025).
Consensus and adjudication: Double annotation of a subset, followed by consensus sessions for disagreement, is standard in both scientific proposal scoring and medical error identification (Yang et al., 25 Dec 2025, Zuo et al., 2024).

The annotation process usually distinguishes between errors rooted in context contradiction (intrinsic), knowledge fabrication (extrinsic), and further, can disaggregate taxonomies such as factual, logical, temporal, and harmful hallucinations (e.g., HalluLens, MedHallBench) (Cossio, 3 Aug 2025, Zuo et al., 2024).

3. Quantitative Metrics and Evaluation Schemes

Evaluation protocols incorporate both direct human annotation and automatic metrics, which attempt to correlate with or predict human-judged hallucinations.

Human-centered metrics:

Hallucination rate: $\text{HallRate} = \frac{\#\text{hallucinations}}{\#\text{total\_outputs}}$ (Cossio, 3 Aug 2025).
Precision, recall, F1: Applied to automatic detector predictions using human judgments as ground truth.
Likert scales: Used for multidimensional assessment (originality, feasibility, value, clinical accuracy, harm severity, etc.) (Yang et al., 25 Dec 2025, Zuo et al., 2024).

Automatic metrics (compared to human judgment):

Metric Family	Example Metrics	Formula/Description
n-gram Overlap	ROUGE-L, SacreBLEU, Knowledge-F1	LCS-based and n-gram precision/recall overlaps (see (Kulkarni et al., 25 Apr 2025))
Semantic Similarity	BERTScore, Knowledge-BERTScore	Contextual embedding similarity: $\mathrm{BERTScore} = \frac{1}{\|G\|}\sum_{g \in G}\max_{s \in S} \cos(v(g), v(s))$
QA/Entailment	Q², FactCC, SummaC	NLI-based span/classification/QA-pair consistency (Kulkarni et al., 25 Apr 2025, Cossio, 3 Aug 2025)
LLM-Judges	GPT-4, DeepSeek-v3 scoring	Zero/few-shot prompted classification; ensemble via Factor Analysis (Kulkarni et al., 25 Apr 2025, Yang et al., 25 Dec 2025)

Metric reliability: Standard metrics (BLEU, ROUGE, even BERTScore) display extremely low alignment with human hallucination labels in multi-dataset experiments (PRAUC ≈ 0.5 random baseline); only GPT-4 as an LLM-judge and carefully constructed ensemble metrics consistently achieve PRAUC > 0.7 (Kulkarni et al., 25 Apr 2025). Correlation coefficients (Spearman ρ < 0.3) across metrics indicate divergent conceptual coverage of hallucinations.

4. Task-Specific Protocols and Domain Extensions

Summarization and chart grounding: Strict definitions of value correctness (VC) and outside information presence (OIP) are employed: VC checks for chart-grounded factual accuracy; OIP identifies extrinsic hallucinations. Protocols involve ≥5 annotators per instance, using Prolific, and the χ² test/statistical significance for improvements (Islam et al., 2023).

Medical applications: MedHallBench uses board-certified clinical experts in multi-day calibration, applies structured rubrics (clinical accuracy, harm severity, hallucination confidence), and free-text justification fields. Automated metrics (e.g., ACHMI) are correlated against expert ratings (Pearson r~0.7) and reinforce reliability via RLHF fine-tuning (Zuo et al., 2024).

Scientific and creative domains: HIC-Bench introduces a five-dimensional evaluation (Originality, Feasibility, Value, Scientific Plausibility, Factual Deviation), with clear heuristics for Intelligent Hallucinations (IH) (high on creativity, low on factual error) vs. Defective Hallucinations (DH) (factually damaged). Double annotation with subject experts, permutation testing, and LLM-blended judging are central (Yang et al., 25 Dec 2025).

5. LLM-Judge Protocols and Human-AI Hybrid Evaluation

Recent studies converge on the use of LLMs—especially GPT-4 or similarly scaled models—as primary judges for hallucination detection, efficiently scaling label production and offering best-in-class empirical alignment with human judgments (PRAUC ≈ 0.74, F1 ≈ 0.9 in specific benchmarks) (Kulkarni et al., 25 Apr 2025, Pesiakhovsky et al., 26 Sep 2025).

Protocols for hybrid human-LLM evaluation in hallucination meta-benchmarks typically include:

LLMs generating candidate hallucination descriptions, followed by expert annotator filtering to improve recall and de-duplicate errors (≥30% increase in error coverage observed).
LLM-generated scores/labels are periodically adjudicated by domain experts (10–20% sampling), with all ambiguous or low-confidence data resolved via consensus.
Statistical agreement between LLM-as-judge and human annotation reaches ~0.9 precision/recall after calibration (Pesiakhovsky et al., 26 Sep 2025, Yang et al., 25 Dec 2025).

Best practices:

Double annotation and adjudication of ambiguous cases.
Free-form, natural-language error extraction for maximum representational coverage.
Explicit measurement and reporting of Cohen’s κ or equivalent IAA statistics.
Release of annotation manuals and raw labels for reproducibility.

6. Limitations, Open Problems, and Future Directions

Current methodologies reveal several critical limitations:

Metric misalignment: Most automatic metrics do not align with human judgment, and inter-metric correlations are low, reflecting the lack of standardization (Kulkarni et al., 25 Apr 2025).
Lack of taxonomy: Absence of a unified hallucination taxonomy introduces cross-study inconsistencies and limits the comparability of annotated benchmarks (Cossio, 3 Aug 2025).
Annotation and protocol drift: Without recurring calibration, inter-annotator agreement degrades, especially for high-fluency or subtle intrinsic hallucinations.
Challenge of extrinsic-correct errors: Both humans and LLM evaluators under-flag true but ungrounded “parametric knowledge” hallucinations, especially in information-rich or world-knowledge-dense outputs (Pesiakhovsky et al., 26 Sep 2025).
Scalability of domain expertise: In medical, legal, or other high-stakes domains, access to sufficient expert annotators is a persistent bottleneck despite validated calibration and scoring strategies (target κ ≥ 0.75) (Zuo et al., 2024).

Open research questions include:

Foundations for a cross-domain, taxonomy-aware annotation manual with high-agreement definitions.
Minimum protocol requirements (number of annotators, consensus steps, calibration frequency) for reproducible and robust hallucination labels.
Intrinsic evaluation of open-source LLM judges as calibration anchors compared to proprietary GPT-4[0] baselines.
Nonlinear dynamics in creativity-factuality tradeoffs, such as the documented “sweet spot” where both intelligent and factual outputs can be optimized simultaneously (Yang et al., 25 Dec 2025).

7. Recommendations and Synthesis

Recent empirical studies converge on several recommendations for rigorous human evaluation of hallucinations:

Deploy LLM-based judges as high-throughput, cost-effective labelers, but always anchor automated annotation pipelines with periodic expert review.
Prefer free-form, expressive error schemas for localization and granularity over entity-based or binary forms, especially in research and clinical settings.
Integrate quantitative (Likert, F1/precision/recall, rate) and qualitative (free-text, adjudication notes) metrics.
Systematically publish annotation guides, calibration protocols, raw labels, and IAA statistics to promote replicability and standardization.
In creative/scientific evaluation, jointly score for epistemic value and fidelity, as hallucinations may encode both “defective” and “intelligent” properties contingent on domain goals (Yang et al., 25 Dec 2025).

The field continues to progress toward truly reliable, scalable, and context-aware human evaluation protocols, predicated on rigorous definitions, calibrated annotation, multi-dimensional rating rubrics, and principled human-LLM hybrid pipelines (Kulkarni et al., 25 Apr 2025, Cossio, 3 Aug 2025, Yang et al., 25 Dec 2025, Islam et al., 2023, Pesiakhovsky et al., 26 Sep 2025, Zuo et al., 2024).