Halu-J: Advanced Hallucination Detection
- Halu-J is a critique-based hallucination judge that integrates evidence selection and structured critique generation for fine-grained detection and explanation of LLM hallucinations.
- It employs a multi-stage pipeline—including evidence categorization, reordering, and aggregation—to enhance fact-checking and improve label accuracy.
- The model leverages combined supervised fine-tuning and direct preference optimization to achieve state-of-the-art performance on benchmarks like ME-FEVER.
Halu-J is a critique-based hallucination judge designed for fine-grained detection and explanation of hallucinations generated by LLMs. It advances the state of the art by integrating a structured evidence selection mechanism and detailed critique generation, thereby enabling both robust label prediction and interpretability in the context of multi-evidence fact-checking. HALU-J operates with 7 billion parameters and is trained to explicitly analyze, select, and critique supporting or contradicting evidence, outperforming comparable LLM-based systems—including GPT-4o—on several hallucination detection benchmarks, especially those requiring reasoning over multiple passages (Wang et al., 2024).
1. Architectural Foundations and Training
HALU-J is built upon the Mistral-7B-Instruct-v0.2 decoder-only transformer architecture, with no modifications to transformer blocks. Its improvements derive from domain-specific fine-tuning routines and a preference-based learning stage. The training pipeline comprises:
- Supervised Fine-Tuning (SFT): Training on 1,952 synthetic "golden" critiques from the ME-FEVER dataset, using DeepSpeed ZeRO Stage 3, gradient checkpointing, FlashAttention, mixed bfloat16/TF32 precision, AdamW optimizer (β₁ = 0.9, β₂ = 0.95, weight_decay = 0.1), peak learning rate (cosine decay, 10 warmup steps), batch size 16, max sequence length 8192, over 20 epochs.
- Direct Preference Optimization (DPO): Post-SFT optimization leverages DPO as formulated by Rafailov et al. (2023). For each triple—where is (claim + evidence), and are chosen and rejected critiques—DPO maximizes the log-probability gap:
where denotes the model's log-probability and is the logistic sigmoid. This stage, run for 3 epochs at learning rate, increases both label accuracy and critique quality, as evidenced by ablation studies.
2. Evidence Selection and Reasoning Pipeline
Rather than concatenating all retrieved documents, HALU-J employs a four-step pipeline for evidence selection and utilization:
- Evidence Categorization: Each retrieved evidence is labeled as either O (“completely irrelevant”), P (“partial relevance”), or T (“highly related”).
- Evidence Reordering: Partition 0 into 1, 2, and 3 corresponding to O, P, and T categories, and process them in order.
- Evidence Analysis: 4 items are discarded. From 5, only spans directly relevant to factual verification are considered. For 6, claims are explicitly compared to assess support or contradiction.
- Aggregation and Critique Generation: The system synthesizes this analysis into a structured output: 0 The classifier head predicts the hallucination label based on the final hidden state.
This structured approach enables more accurate and transparent hallucination detection, particularly in contexts where multiple competing or distracting evidence passages are present.
3. Critique Formulation and Training Objectives
HALU-J outputs a two-field Python dictionary comprising a detailed step-by-step critique ("reasoning") and a discrete factuality label ("factuality" ∈ {True, False, Neutral}). During SFT, model training maximizes the likelihood of generating the correct critique sequence while minimizing cross-entropy loss on the label:
7
where 8 denotes the input claim and 9 the evidence set. The DPO-finalized model further optimizes for the rank-ordering of correct artifact generations.
HALU-J's robustness to rigid output formatting, unlike most competing LLMs, is attributed to explicit training on structured critique-label outputs, allowing reliable downstream parsing and interpretation.
4. Dataset Design: ME-FEVER
ME-FEVER was constructed atop FEVER (Thorne et al. 2018) using GPT-4-Turbo to synthesize complex, diverse evidence sets required for multi-evidence hallucination detection. Each instance consists of:
- 2 “completely irrelevant” passages (random FEVER pages)
- 4 “partially irrelevant” passages (same topic, not addressing the claim)
- 1–3 “highly related” passages (including one ground-truth and 1–2 misleading, claim-consistent but confusing distractors)
The dataset includes 3,901 instances (2,663 train, 1,238 test). This design enables fine-grained evaluation of models' abilities to navigate noise and selectively utilize supporting or contradictory evidence.
5. Benchmarking and Quantitative Evaluation
5.1 Label Prediction Accuracy
A comprehensive comparison (Table 1) demonstrates HALU-J’s state-of-the-art multi-evidence accuracy:
| Model | ME-FEVER | FEVER | ANLI | WANLI | HaluEval | KBQA |
|---|---|---|---|---|---|---|
| GPT-3.5-Turbo | 0.81 | 0.87 | 0.47 | 0.47 | 0.59 | 0.69 |
| GPT-4o | 0.83 | 0.88 | 0.74 | 0.60 | 0.81 | 0.84 |
| Mistral-7B-Instruct-v0.2 | 0.78 | 0.82 | 0.62 | 0.54 | 0.57 | 0.68 |
| Llama-2-13B-Chat | 0.13 | 0.37 | 0.27 | 0.29 | 0.24 | 0.19 |
| Llama-3-8B-Instruct | 0.63 | 0.03 | 0.02 | 0.00 | 0.01 | 0.20 |
| Qwen1.5-7B-Chat | 0.49 | 0.79 | 0.68 | 0.53 | 0.61 | 0.69 |
| HALU-J (w/o DPO) | 0.90 | 0.90 | 0.69 | 0.54 | 0.65 | 0.76 |
| HALU-J (with DPO) | 0.91 | 0.90 | 0.70 | 0.54 | 0.65 | 0.76 |
5.2 Critique Quality and Evidence-Match
| Model | Critique Score | Evidence-Match Rate |
|---|---|---|
| GPT-3.5-Turbo | 72.35 | 59.29% |
| GPT-4o | 85.85 | 61.43% |
| Mistral-7B-Instruct | 61.30 | 51.22% |
| Llama-2-13B-Chat | 45.20 | 40.86% |
| Llama-3-8B-Instruct | 76.15 | 47.57% |
| Qwen1.5-7B-Chat | 66.40 | 52.32% |
| HALU-J (w/o DPO) | 82.60 | 66.89% |
| HALU-J (with DPO) | 83.90 | 68.11% |
DPO Ablation
| Variant | ME-FEVER Acc | ANLI Acc | Critique Score | Evidence-Match |
|---|---|---|---|---|
| HALU-J (w/o DPO) | 0.90 | 0.69 | 82.60 | 66.89% |
| HALU-J (with DPO) | 0.91 | 0.70 | 83.90 | 68.11% |
Imposing a rigid JSON format benefits some LLMs but degrades others; HALU-J is robust due to its format-focused SFT regimen.
6. Impact, Limitations, and Outlook
HALU-J establishes a new state of the art for multiple-evidence hallucination detection, achieving 0.91 accuracy on ME-FEVER, notably surpassing GPT-4o (0.83). The multi-stage evidence handling pipeline yields interpretable and high-quality critiques, while DPO fine-tuning ensures consistent improvements in accuracy, critique quality (+1.3 points), and evidence-match rate (+1.2 percentage points).
Limitations include a primary focus on commonsense or information-seeking hallucinations, with computational or numerical error detection not addressed. Single-evidence performance offers room for future enhancement. The open-source release of HALU-J and ME-FEVER provides a valuable resource for advancing research in fact-checking, critique generation, and preference-based evaluation of LLMs (Wang et al., 2024).