Halu-J: Advanced Hallucination Detection

Updated 28 April 2026

Halu-J is a critique-based hallucination judge that integrates evidence selection and structured critique generation for fine-grained detection and explanation of LLM hallucinations.
It employs a multi-stage pipeline—including evidence categorization, reordering, and aggregation—to enhance fact-checking and improve label accuracy.
The model leverages combined supervised fine-tuning and direct preference optimization to achieve state-of-the-art performance on benchmarks like ME-FEVER.

Halu-J is a critique-based hallucination judge designed for fine-grained detection and explanation of hallucinations generated by LLMs. It advances the state of the art by integrating a structured evidence selection mechanism and detailed critique generation, thereby enabling both robust label prediction and interpretability in the context of multi-evidence fact-checking. HALU-J operates with 7 billion parameters and is trained to explicitly analyze, select, and critique supporting or contradicting evidence, outperforming comparable LLM-based systems—including GPT-4o—on several hallucination detection benchmarks, especially those requiring reasoning over multiple passages (Wang et al., 2024).

1. Architectural Foundations and Training

HALU-J is built upon the Mistral-7B-Instruct-v0.2 decoder-only transformer architecture, with no modifications to transformer blocks. Its improvements derive from domain-specific fine-tuning routines and a preference-based learning stage. The training pipeline comprises:

Supervised Fine-Tuning (SFT): Training on 1,952 synthetic "golden" critiques from the ME-FEVER dataset, using DeepSpeed ZeRO Stage 3, gradient checkpointing, FlashAttention, mixed bfloat16/TF32 precision, AdamW optimizer (β₁ = 0.9, β₂ = 0.95, weight_decay = 0.1), peak learning rate $1 \times 10^{-5}$ (cosine decay, 10 warmup steps), batch size 16, max sequence length 8192, over 20 epochs.
Direct Preference Optimization (DPO): Post-SFT optimization leverages DPO as formulated by Rafailov et al. (2023). For each $(x, y^+, y^-)$ triple—where $x$ is (claim + evidence), $y^+$ and $y^-$ are chosen and rejected critiques—DPO maximizes the log-probability gap:

$\mathcal{L}_{\rm DPO} = -\mathbb{E}_{(x,y^+,y^-)}\big[\log \sigma\big(\pi_\theta(y^+\mid x)-\pi_\theta(y^-\mid x)\big)\big]$

where $\pi_\theta$ denotes the model's log-probability and $\sigma$ is the logistic sigmoid. This stage, run for 3 epochs at $1 \times 10^{-7}$ learning rate, increases both label accuracy and critique quality, as evidenced by ablation studies.

2. Evidence Selection and Reasoning Pipeline

Rather than concatenating all retrieved documents, HALU-J employs a four-step pipeline for evidence selection and utilization:

Evidence Categorization: Each retrieved evidence $e_i \in E$ is labeled as either O (“completely irrelevant”), P (“partial relevance”), or T (“highly related”).
Evidence Reordering: Partition $(x, y^+, y^-)$ 0 into $(x, y^+, y^-)$ 1, $(x, y^+, y^-)$ 2, and $(x, y^+, y^-)$ 3 corresponding to O, P, and T categories, and process them in order.
Evidence Analysis: $(x, y^+, y^-)$ 4 items are discarded. From $(x, y^+, y^-)$ 5, only spans directly relevant to factual verification are considered. For $(x, y^+, y^-)$ 6, claims are explicitly compared to assess support or contradiction.
Aggregation and Critique Generation: The system synthesizes this analysis into a structured output: $x$ 0 The classifier head predicts the hallucination label based on the final hidden state.

This structured approach enables more accurate and transparent hallucination detection, particularly in contexts where multiple competing or distracting evidence passages are present.

3. Critique Formulation and Training Objectives

HALU-J outputs a two-field Python dictionary comprising a detailed step-by-step critique ("reasoning") and a discrete factuality label ("factuality" ∈ {True, False, Neutral}). During SFT, model training maximizes the likelihood of generating the correct critique sequence while minimizing cross-entropy loss on the label:

$(x, y^+, y^-)$ 7

where $(x, y^+, y^-)$ 8 denotes the input claim and $(x, y^+, y^-)$ 9 the evidence set. The DPO-finalized model further optimizes for the rank-ordering of correct artifact generations.

HALU-J's robustness to rigid output formatting, unlike most competing LLMs, is attributed to explicit training on structured critique-label outputs, allowing reliable downstream parsing and interpretation.

4. Dataset Design: ME-FEVER

ME-FEVER was constructed atop FEVER (Thorne et al. 2018) using GPT-4-Turbo to synthesize complex, diverse evidence sets required for multi-evidence hallucination detection. Each instance consists of:

2 “completely irrelevant” passages (random FEVER pages)
4 “partially irrelevant” passages (same topic, not addressing the claim)
1–3 “highly related” passages (including one ground-truth and 1–2 misleading, claim-consistent but confusing distractors)

The dataset includes 3,901 instances (2,663 train, 1,238 test). This design enables fine-grained evaluation of models' abilities to navigate noise and selectively utilize supporting or contradictory evidence.

5. Benchmarking and Quantitative Evaluation

5.1 Label Prediction Accuracy

A comprehensive comparison (Table 1) demonstrates HALU-J’s state-of-the-art multi-evidence accuracy:

Model	ME-FEVER	FEVER	ANLI	WANLI	HaluEval	KBQA
GPT-3.5-Turbo	0.81	0.87	0.47	0.47	0.59	0.69
GPT-4o	0.83	0.88	0.74	0.60	0.81	0.84
Mistral-7B-Instruct-v0.2	0.78	0.82	0.62	0.54	0.57	0.68
Llama-2-13B-Chat	0.13	0.37	0.27	0.29	0.24	0.19
Llama-3-8B-Instruct	0.63	0.03	0.02	0.00	0.01	0.20
Qwen1.5-7B-Chat	0.49	0.79	0.68	0.53	0.61	0.69
HALU-J (w/o DPO)	0.90	0.90	0.69	0.54	0.65	0.76
HALU-J (with DPO)	0.91	0.90	0.70	0.54	0.65	0.76

5.2 Critique Quality and Evidence-Match

Model	Critique Score	Evidence-Match Rate
GPT-3.5-Turbo	72.35	59.29%
GPT-4o	85.85	61.43%
Mistral-7B-Instruct	61.30	51.22%
Llama-2-13B-Chat	45.20	40.86%
Llama-3-8B-Instruct	76.15	47.57%
Qwen1.5-7B-Chat	66.40	52.32%
HALU-J (w/o DPO)	82.60	66.89%
HALU-J (with DPO)	83.90	68.11%

DPO Ablation

Variant	ME-FEVER Acc	ANLI Acc	Critique Score	Evidence-Match
HALU-J (w/o DPO)	0.90	0.69	82.60	66.89%
HALU-J (with DPO)	0.91	0.70	83.90	68.11%

Imposing a rigid JSON format benefits some LLMs but degrades others; HALU-J is robust due to its format-focused SFT regimen.

6. Impact, Limitations, and Outlook

HALU-J establishes a new state of the art for multiple-evidence hallucination detection, achieving 0.91 accuracy on ME-FEVER, notably surpassing GPT-4o (0.83). The multi-stage evidence handling pipeline yields interpretable and high-quality critiques, while DPO fine-tuning ensures consistent improvements in accuracy, critique quality (+1.3 points), and evidence-match rate (+1.2 percentage points).

Limitations include a primary focus on commonsense or information-seeking hallucinations, with computational or numerical error detection not addressed. Single-evidence performance offers room for future enhancement. The open-source release of HALU-J and ME-FEVER provides a valuable resource for advancing research in fact-checking, critique generation, and preference-based evaluation of LLMs (Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Halu-J: Critique-Based Hallucination Judge (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Halu-J.