HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs

Published 4 May 2026 in cs.CL | (2605.02443v1)

Abstract: LLMs have demonstrated remarkable capabilities across diverse natural language processing tasks, yet they remain susceptible to hallucinations -- generating content that is factually incorrect, unfaithful to provided context, or misaligned with user instructions. We present HalluScan, a comprehensive benchmark framework that systematically evaluates hallucination detection and mitigation across 72 configurations spanning 6 detection methods, 4 open-weight model families, and 3 diverse domains. We introduce three key contributions: (1) HalluScore, a novel composite metric that achieves a Pearson correlation of r = 0.41 with human expert judgments; (2) Adaptive Detection Routing (ADR), an intelligent routing algorithm achieving 2.0x cost reduction with only 0.1% AUROC degradation; and (3) systematic error cascade decomposition revealing substantial variation in hallucination error types across domains. Our experiments reveal that NLI Verification achieves the highest overall AUROC of 0.88, while RAV achieves the second-highest AUROC of 0.66.

Abstract PDF Upgrade to Chat

Authors (1)

Ahmed Cherif

Summary

The paper introduces HalluScore and Adaptive Detection Routing (ADR) to benchmark hallucination detection in instruction-following LLMs.
It systematically compares six detection strategies over 72 configurations, evaluating cost-quality trade-offs and domain-specific performance.
Findings highlight entailment-based (NLI) methods as most robust, paving the way for efficient, post-deployment quality monitoring.

Authoritative Summary of "HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs" (2605.02443)

Introduction and Motivation

LLMs demonstrate strong performance across a wide range of language tasks but are fundamentally susceptible to various forms of "hallucination," where outputs are factually incorrect, unfaithful to the context, or fail to adhere to instructions. This problem has substantial ramifications in high-stakes domains such as scientific research, open-domain QA, and commonsense reasoning. While ongoing work has tackled hallucination detection and mitigation, existing benchmarks largely suffer from narrow scope, lacking systematic, principled comparison across detection methods, model architectures, and domains.

HalluScan Framework Design

HalluScan is introduced as a robust benchmark to systematically evaluate hallucination detection and mitigation in instruction-following LLMs. The framework's coverage is comprehensive, spanning 72 unique configurations (6 detection methods × 4 instruction-tuned model families × 3 domains). Key technical contributions include:

HalluScore: A composite geometric mean metric integrating factual error rate, semantic coherence, and fabrication rate, tuned for alignment with expert human judgments ( $r=0.41$ ).
Adaptive Detection Routing (ADR): A cost-aware selection algorithm for hallucination detection methods, leveraging input characteristics to achieve $2.0\times$ cost savings with negligible AUROC loss (0.1% degradation).
Error Cascade Decomposition: Fine-grained analysis distinguishing error sources (knowledge gaps, reasoning failures, instruction misalignment) and their prevalence across domains.

Hallucination Taxonomy

The paper operationalizes hallucinations along three axes:

Factual Hallucination: Contradiction to established world knowledge.
Faithfulness Hallucination: Divergence or contradiction of source context.
Instruction Hallucination: Violation of explicit prompt constraints or task requirements.

Each of these error types poses unique detection challenges, motivating the multi-method benchmarking approach.

Detection Methods Evaluated

HalluScan implements and systematically compares six major hallucination detection strategies:

Self-Consistency (SC): Assesses inter-response agreement among multiple generations.
Self-Evaluation (SE): Models introspectively score the factual accuracy of their own outputs.
Semantic Entropy (SemE): Computes uncertainty over semantic clusters of generated responses.
LLM-as-Judge (Judge): Secondary LLMs rate the factuality and faithfulness of outputs.
Natural Language Inference (NLI): Pre-trained entailment models verify that generated claims are supported by input/evidence.
Retrieval-Augmented Verification (RAV): Decomposes output into claims, retrieves external evidence, and applies NLI for fact-checking.

Experimental Setup

The benchmark evaluates four open-weight, instruction-tuned LLM families (Llama-3.1-8B, Llama-4-Scout-17B, Qwen3-32B, GPT-OSS-20B) across three representative domains: Scientific (TruthfulQA), Open-Domain QA (Natural Questions), and Commonsense (ARC-Challenge), each sampled with 8 representative queries.

Metrics include AUROC (primary), F1, precision, recall, Expected Calibration Error (ECE), and detection latency. All experiments utilize both fast local and API-based inference for realistic cost measurement.

Key Findings and Quantitative Results

NLI Verification demonstrates the strongest, most robust detection (mean AUROC 0.88; perfect AUROC 1.00 in scientific domain), outperforming all other methods with statistical significance (Cohen's $d$ up to 1.82).
RAV is the second most effective (mean AUROC 0.66), particularly in open-domain and scientific tasks, reflecting the importance of evidence retrieval.
Local-only methods (SC, SemE) are computationally efficient but limited (mean AUROCs 0.56 and 0.45, respectively).
Domain Effects: Scientific queries are most amenable to detection (mean AUROC 0.67), while commonsense tasks are most challenging (mean AUROC 0.51, often near chance).
ADR achieves a $2.0\times$ reduction in average computational cost (from 18.5s to 9.4s/query) while preserving high detection quality (AUROC 0.85 vs. 0.88 for NLI), by routing low-risk queries to fast local methods.
Calibration: NLI-based methods are best calibrated (ECE 0.185), whereas LLM-as-Judge can be severely miscalibrated (ECE 0.466).
Cross-Domain Transfer: NLI-based detectors show the best generalization, especially when thresholds are trained on open-domain data.

Implications and Theoretical/Practical Value

The study's comprehensive configuration sweep enables nuanced recommendations for developers and researchers:

Entailment-based detection (NLI) generalizes best across tasks and domains; it should be the default unless computational constraints dominate.
Retrieval-based verification (RAV) is preferable when evidence is easily accessible and high precision is necessary.
Cost-aware cascades (ADR) can scale hallucination monitoring to real-world LLM deployments, where latency and API cost are bottlenecks.
Domain specificity critically affects detectability; commonsense reasoning remains unsolved—even state-of-the-art detectors only marginally outperform random baselines.
Practically, HalluScan and HalluScore can function as continuous quality monitoring tools, supporting post-deployment QA, model validation, and compliance auditing.

Limitations

Small per-domain sample size (N=8) introduces uncertainty in fine-grained rankings.
All detection evaluation is English-centric; results may not generalize to multilingual or low-resource settings.
Largest model evaluated is 32B; scaling trends for very large LLMs (70B+) are not addressed.

Speculation on Future Developments

Further research may extend HalluScan to:

Visual and multimodal LLMs, incorporating vision-language hallucination detection.
Multilingual setups, with emphasis on adaptation of NLI verification and evidence retrieval to low-resource languages.
Agentic hallucination, as LLMs are embedded in tool-using agents, necessitating detection of off-policy or environment-level errors.
Continual, online detection adaptation to evolving model outputs and system requirements.
Theoretical analysis of the inherent limitations of current detection techniques relative to evolving LLM architectures.

Conclusion

HalluScan establishes a principled and exhaustive benchmark for hallucination detection and mitigation in instruction-following LLMs, offering a robust comparative evaluation across models, methods, and domains. The strong, consistent performance of entailment-based approaches (NLI), effective cost-quality trade-offs enabled by adaptive routing, and the moderate-human alignment of composite metrics (HalluScore) collectively advance practical and theoretical understanding of LLM hallucination management. The released codebase (2605.02443) provides a reproducibility baseline and a testbed for method development and evaluation in future research.

Markdown Report Issue