HDMBench Dataset Overview
- HDMBench is a comprehensive, annotated benchmark dataset designed to evaluate hallucination detection in LLM outputs across diverse enterprise scenarios.
- It comprises 50,000 contextual documents sourced from varied datasets with detailed token-level and sentence-level annotations for nuanced error analysis.
- The evaluation framework employs metrics like precision, recall, and F1 scores to rigorously benchmark LLM performance in practical, context-rich deployments.
HDMBench is a large-scale, meticulously annotated benchmark dataset developed for evaluating hallucination detection in LLM outputs, particularly within enterprise contexts. It is engineered to address the deficiencies of prior benchmarks by encompassing the full breadth of hallucination phenomena encountered in practical deployments—specifically, context-based, common knowledge, enterprise-specific, and innocuous statements. HDMBench’s design enables the development and rigorous assessment of systems that require fine-grained error localization and validation of generated content with respect to both source context and world knowledge.
1. Dataset Construction and Structure
HDMBench comprises approximately 50,000 contextual documents, curated from heterogeneous sources to ensure realistic and domain-relevant scenarios:
- Context Sources
- Extracts from RAGTruth (itself built from MS MARCO, CNN/Daily Mail, etc.)
- Enterprise support tickets from platforms such as Jira
- Curated MS MARCO passages with variable context lengths
- SQuAD-derived questions used as pseudo-contexts
- Samples from Red Pajama v2, a large-scale English corpus
For each document, multiple prompts—detail-oriented, summarization, information-seeking—were generated to simulate diverse query styles. Responses were constructed via a pipeline engaging eight model variants (ranging from 2B to over 7B parameters), intentionally introducing varied response styles and error types.
The annotation protocol is highly granular:
- Label Taxonomy
- Supported by context: Information verifiable against input documents
- Supported by general knowledge: Widely recognized facts not contained in the context
- Hallucinated: False, unsupported, or fabricated content
Annotations are provided at the sentence and phrase level, including explicit span identification and accompanying reasoning justifications. Each response is thus marked with both categorical labels and precise token-level (span) assignments enabling both holistic and atomistic evaluation of hallucination phenomena.
2. Evaluation Metrics and Aggregation Methodology
Assessment on HDMBench uses established and rigorous metrics, tailored for both overall and fine-grained analysis:
- Precision, Recall, F1 Score
- Precision: Fraction of flagged hallucinated content that is truly hallucinated
- Recall: Fraction of actual hallucinations correctly identified
- F1: Harmonic mean of precision and recall
Metrics are reported individually for key tasks: Question Answering (QA), Data-to-Text, and Summarization.
- Balanced Accuracy
- Applied particularly to common-knowledge hallucination detection, this metric mitigates the impact of class imbalance by equally weighting true positive and negative rates.
- Token-level and Sentence-level Aggregation
- The HDM-2 framework produces token-level hallucination scores .
- A configurable aggregation function (for example, max, average, or proportion above threshold ) aggregates token-level scores within a sentence.
- Candidate sentences are flagged as hallucinated if , where is a predefined threshold; formally, .
This hierarchical measurement approach improves interpretability and supports localized error attribution within generated text.
3. Comparative Positioning Among Existing Datasets
HDMBench is distinguished from benchmarks such as RagTruth and TruthfulQA on both scope and functional design:
Dataset | Main Focus | Annotation Granularity |
---|---|---|
HDMBench | Enterprise-oriented; Context + Knowledge; Multi-type hallucination taxonomy | Token-level and sentence-level; multi-class reasoning |
RagTruth | Context adherence in retrieval-augmented generation | Coarse sentence or passage-level |
TruthfulQA | Common misconceptions; high-level static errors | Coarse question-level |
TruthfulQA focuses primarily on model misconceptions and isolated, high-level errors, while RagTruth evaluates adherence to context in retrieval-augmented systems. HDMBench, in contrast, offers a taxonomy spanning context-based, common-knowledge, enterprise-specific, and innocuous categories with balanced representation and token-level annotation, corresponding more directly to enterprise requirements and supporting nuanced system evaluation.
4. Reported Model Performance on HDMBench
The HDM-2 model—a multi-task architecture integrating context verification and knowledge validation—serves as a primary baseline and demonstrates robust performance on HDMBench:
- Common-Knowledge Hallucination Detection
- Precision: ~74.8%
- Recall: 74.4%
- F1 Score: 73.6%
- Token and Sentence-Level Analysis
- HDM-2 outputs an overall hallucination score per response and token-level scores .
- Aggregation functions and thresholding enable comparative, quantitative performance measurement.
- Parameter Efficiency
- On context-based hallucination detection (e.g., RagTruth), HDM-2 (3B parameters) simultaneously achieves high precision and F1 scores, outperforming larger models such as Qwen, GPT-4o, and GPT-4o-mini on specific metrics.
Reported figures and benchmarking indicate that HDMBench supports the development and discrimination of hallucination detection approaches both at aggregate and detailed granularity.
5. Accessibility and Reproducibility
HDMBench is publicly available for research and benchmarking purposes. Both the dataset and the associated HDM-2 model weights, as well as inference code, are released under open terms, facilitating transparent comparison and further research. Resources can be accessed at https://github.com/aimonlabs/hallucination-detection-model.
Direct access to all components enables an extensible experimental substrate for the enterprise LLM reliability community, advancing the state of hallucination detection through reproducibility and shared standards.
6. Significance for Enterprise and Research Applications
HDMBench represents a methodological advance in the evaluation of LLM outputs by aligning its taxonomy and scenario design with actual enterprise scenarios, including both explicit context verification and implicit knowledge validation. Its comprehensive coverage of diverse hallucination types, rigorous annotation regime, and support for detailed metric analysis address both academic benchmarking and practical system deployment requirements.
A plausible implication is that the dataset will facilitate the development of LLM-based systems with higher degrees of factual reliability, contextual adherence, and explainability, particularly in industry domains where risk tolerance for inaccurate content is low. Its structure also enables future extensions, such as additional enterprise domains or evolving error typologies, without sacrificing comparability or granularity.