Papers
Topics
Authors
Recent
Search
2000 character limit reached

HDMBench: Fine-Grained Hallucination Dataset

Updated 19 March 2026
  • HDMBench is a large-scale dataset with 50,000 context–question–response triples, offering fine-grained sentence and phrase-level annotations for precise LLM hallucination detection.
  • It implements a rigorous three-class annotation taxonomy with high inter-annotator agreement, ensuring reliable labeling of context-based and common-knowledge hallucinations.
  • Baseline evaluations show that the HDM-2 model outperforms GPT-4 variants in precision and F1 scores, facilitating adaptive threshold tuning for diverse enterprise and open-domain applications.

HDMBench is a large-scale, fine-grained evaluation dataset designed for benchmarking hallucination detection in LLM outputs, with a specific focus on distinguishing context-based and common-knowledge hallucinations, especially in enterprise and open-domain deployments (Paudel et al., 9 Apr 2025). It provides sentence- and phrase-level span annotations, enabling evaluation and training of models that require precise, token-level hallucination identification.

1. Dataset Composition and Source Coverage

HDMBench comprises approximately 50,000 context documents. Each document is paired with one question—generated using multiple prompt templates—and one LLM-generated response, yielding roughly 50,000 unique context–question–response triples. Each instance is annotated at both the sentence and phrase levels, leading to about 50,000 sentence-level and 200,000 phrase-level labeled spans.

Context sources are deliberately diverse:

  • RAGTruth contexts: Drawn from MS MARCO passages and CNN/Daily Mail summaries.
  • Enterprise support tickets: Sourced from structured Jira ticket dumps.
  • Standard QA corpora: Repurposed SQuAD passages.
  • Synthetic web text: Random samples from Red Pajama v2.

LLM responses span a wide model spectrum, including Mistral-7B, Qwen-2.5B, Mixtral-8×7B, Nous-Hermes, and others. Prompting styles interleave strict factuality with targeted “controlled hallucination injection” to generate a realistic spectrum of possible error modes. All content is in English, covering open-domain, news, enterprise, and synthetic language settings.

The dataset is partitioned randomly into 60% training (\approx30,000), 15% validation (\approx7,500), and 25% test (\approx12,500) splits. No public split existed prior; this structure ensures unbiased generalization and evaluation (Paudel et al., 9 Apr 2025).

2. Annotation Protocol and Taxonomy

Annotations are assigned per maximal phrase span within each sentence of the response, using a three-class schema:

  1. supported_by_context
  2. supported_by_general_knowledge
  3. hallucination

The annotation guidelines require annotators to consult both the provided context and their general world knowledge, and to justify label assignments with a brief, sentence-level rationale (e.g., “this fact does not appear in context nor is it widely known”). Trivially true statements (“innocuous”) are subsumed under supported_by_general_knowledge.

Quality assurance includes dual annotation of 10% of the dataset, with disagreements adjudicated by an expert reviewer. Inter-annotator agreement is high: Cohen’s κ\kappa of 0.85 at the sentence-level and 0.78 at the phrase-level, indicating strong label reliability for downstream benchmarking (Paudel et al., 9 Apr 2025).

3. Evaluation Metrics and Protocols

HDMBench supports both coarse (response-level) and fine-grained (span-level) hallucination detection benchmarking.

  • Response-level metrics: Precision (PP), recall (RR), and F1-score are computed over hallucination classification.

P=TPTP+FP,R=TPTP+FN,F1=2PRP+RP = \frac{|TP|}{|TP| + |FP|}, \quad R = \frac{|TP|}{|TP| + |FN|}, \quad F_1 = 2 \frac{P \cdot R}{P + R}

  • Token-level annotation accuracy: For ground-truth label yi{0,1}y_i \in \{0,1\} (hallucinated/not hallucinated) and prediction y^i\hat y_i,

Acctoken=1ni=1n1(y^i=yi)\text{Acc}_\text{token} = \frac{1}{n} \sum_{i=1}^n \mathbf{1}(\hat y_i = y_i)

  • Hallucination scoring: Both context-based hs(c,r)[0,1]h_s(c,r) \in [0,1] for the response, and token-level hw(c,r)=[hw1,,hwn][0,1]n\mathbf{h}_w(c,r) = [h_w^1, \dots, h_w^n] \in [0,1]^n are supported. Sentence-level aggregation functions (ff) include maximum, average, and proportion above a threshold γ\gamma, with default flagging if f(hwS)>tf(\mathbf h_w^S) > t (t=0.5t=0.5, γ=0.2\gamma=0.2).

This protocol facilitates both strict detection and nuanced threshold calibration at the token or phrase level.

4. Baseline Results and Comparative Model Performance

HDMBench provides strong empirical baselines for both zero-shot and trained hallucination detectors, with comparative results reported for prominent models:

Model Precision Recall Balanced Acc. F1
Qwen-2.5 49.1 78.6 59.7 60.4
GPT-4o 68.4 51.4 67.1 58.7
GPT-4o-mini 68.4 49.9 66.6 57.7
HDM-2 (ours) 74.8 74.4 71.7 73.6

For common-knowledge hallucination detection, HDM-2 exhibits notable F1 and precision improvements (6–14 percentage points over GPT-4o), highlighting the importance of multi-granular annotation and subtle non-context errors. Phrase-level labeling supports token-level calibration, with up to a 5 percentage point F1 benefit over sentence-only models (Paudel et al., 9 Apr 2025).

HDMBench delivers balanced and fine-grained representation of context-based and common-knowledge hallucinations, enabling robust multi-task hallucination detector training. Error types frequently missed by zero-shot LLM-as-judge methods are systematically captured. The dataset is optimized for enterprise settings where hallucination error has operational impact, but also generalizes to open-domain evaluation.

Recommended applications include:

  • Benchmarking novel hallucination-detection systems across a wide range of question answering, enterprise, and synthetic domains.
  • Adaptive threshold tuning for token-level or span-level hallucination flagging.
  • Cross-lingual extension for multilingual enterprise corpora via the documented annotation protocol.
  • Training explainable detectors or enhancing LLM self-explanation modules using provided justifications.

High inter-annotator agreement (Cohen’s κ0.78\kappa \geq 0.78) demonstrates the reliability of the labeling framework and supports the extension or finetuning of hallucination-aware systems (Paudel et al., 9 Apr 2025).

6. Significance and Open Resource Availability

HDMBench addresses a critical evaluation gap for hallucination detection at both the context and general-knowledge levels, with structure and annotation standards enabling granular and explainable detection research. The dataset, together with model weights and evaluation code, is publicly released and directly supports research in computational efficiency, domain specialization, and fine-grained error identification for LLM output quality assurance in enterprise settings (Paudel et al., 9 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HDMBench.