UniFact Framework: Unified Evaluation of LLM Factuality
- UniFact Framework is a unified evaluation paradigm that integrates hallucination detection and fact verification to assess LLM outputs.
- It employs dynamic instance generation and authoritative labeling to provide direct, instance-level comparisons between evaluation methods.
- Empirical results reveal complementary strengths of HD and FV approaches, with hybrid methods outperforming individual techniques in accuracy and AUC.
The UniFact framework provides a unified, dynamic evaluation paradigm for assessing factual consistency in outputs from LLMs. It systematically bridges two previously distinct methodologies: Hallucination Detection (HD), which leverages internal model signals to detect factual errors, and Fact Verification (FV), which treats generated claims as stand-alone statements to be verified against external evidence. UniFact enables direct, instance-level comparison between HD and FV approaches by dynamically generating model outputs with authoritative ground truth, applying automated binary factuality annotation, and enforcing a unified evaluation protocol. Empirical findings demonstrate that HD and FV address complementary failure modes, that hybridization delivers superior performance, and that unification addresses underlying schisms in factuality assessment research (Su et al., 2 Dec 2025).
1. Problem Statement and Formal Framework
Fact Verification (FV) is defined as evaluating whether a claim is consistent with retrieved evidence . The scoring function is , where higher values indicate greater likelihood of non-factuality (i.e., hallucination), and may originate from either humans or LLMs.
Hallucination Detection (HD) takes as input an LLM , its generated output under a specific prompt , and captures both intrinsic signals (such as logits, token entropies, activations, and cross-sample consistency) and optionally extrinsic signals . Its scoring function is , with denoting a hallucination.
Both paradigms reduce to binary classification: given instance with score and ground-truth label ($1$ for non-factual/hallucinated), performance is measured via Accuracy and Area Under the ROC Curve (AUC) over instances.
2. UniFact Pipeline Architecture
UniFact implements a three-stage pipeline for unifying evaluation:
Stage 1: Dynamic Instance Generation
Each instance comprises an input triplet , with a factual question, a set of authoritative answers, and authoritative evidence passages. The target LLM () generates an answer from , and simultaneously all desired intrinsic signals are captured, including token entropy, hidden states, and, if required, multiple stochastic samples for consistency-based methods.
Stage 2: Reference-Based Automated Annotation
Ground-truth labels are generated by a separate “judge” LLM (, specifically Qwen-2.5-32B) using and a strict rubric to determine omissions or contradictions of . Validation by human annotators on 1,602 samples showed 97.4% agreement on hallucination labels and 99.0% on no-hallucination, supporting the quality of automated labels.
Stage 3: Unified Evaluation Interface
Both HD and FV methods are applied to the same generated outputs with their respective inputs:
- HD methods use .
- FV methods treat as a claim, retrieve evidence from a Wikipedia corpus (using BM25 over DPR Wikipedia), and predict . Both outputs are evaluated via Accuracy and AUC against the ground-truth labels .
3. Mathematical Metrics and Hybridization
Evaluation within UniFact employs several core mathematical formulations:
- Binary Classification Metrics: Both paradigms are measured by Accuracy and AUC over .
- Hybrid Scoring (Score-Level Fusion):
This approach combines HD and FV scores, both rescaled so higher is “more likely hallucinated”.
- Evidence-Aware Pipeline (Hierarchical): FV module outputs Supported, Contradicted, or Not Enough Information (NEI); Supported/Contradicted are accepted directly, otherwise the system falls back on the HD method’s .
- Complementarity Metrics:
- Average Complementarity Score (ACS): proportion of instances where the two methods’ successes and failures are complementary.
- Average Synergy Gain (ASG): improvement of the union over the best individual accuracy.
- Average Error Correction Rate (AECR): mean probability that one's correct prediction corrects the other’s error.
4. Experimental Protocol and Benchmarking
UniFact experimental evaluations span diverse datasets, model families, and detection baselines:
- Test Sets: TriviaQA (TQA), NaturalQuestions-Open (NQ), PopQA (PQA), 2WikiMultihopQA: Bridge & Comparison, and HotpotQA (Comparison only, HComp), each with 500 test examples.
- LLM Families: Meta-Llama-3.1-8B-Instruct, Qwen2.5-14B-Instruct.
- HD Baselines: Intrinsic-only (LNPP, LNPE, PTrue, SAPLMA, MIND, EUBHD), consistency-based (SelfCheckGPT variants; SE, SEU, SIndex).
- FV Baselines: LLM-based (LLM-Q, LLM-QA), NLI-based (BERT-Q, BERT-QA, fine-tuned on automated training labels).
- Retrieval: BM25 over DPR Wikipedia (21 million passages), top-3 passages per query.
- Statistical Testing: Paired t-tests according to SMJ07 for significance.
Both Accuracy and AUC are used to quantify each method’s ability to distinguish hallucinated/non-factual outputs.
5. Principal Empirical Findings
The UniFact benchmark demonstrates several central findings:
- No Universal Superiority: No single paradigm (HD or FV) dominates across all benchmarks or LLM families. HD methods may excel on certain datasets with LLaMA but underperform on Qwen; FV exhibits more consistency and benefits when the generated answer is included in evidence retrieval.
- Complementarity: Cross-paradigm complementarity is substantial. Cross-paradigm Average Complementarity Score (ACS) is 0.428, exceeding intra-HD (0.315) and intra-FV (0.379). Average Synergy Gain (ASG) and Average Error Correction Rate (AECR) also show highest gains for HD–FV hybrid pairs.
- Hybrid State-of-the-Art: Score-Level Fusion and hierarchical Evidence-Aware pipelines consistently outperform either paradigm individually (e.g., best AUC 0.825 on TQA with LLaMA, 0.817 with Qwen). Hybridization also reduces variance across LLM backbones.
| Metric | Cross-paradigm | Intra-HD | Intra-FV |
|---|---|---|---|
| ACS | 0.428 | 0.315 | 0.379 |
| ASG | 0.144 | 0.118 | 0.102 |
| AECR | 0.634 | 0.503 | 0.496 |
ACS: Average Complementarity Score; ASG: Average Synergy Gain; AECR: Average Error Correction Rate.
6. Analytical Insights on the FV–HD Divide and Unification
The divergence between FV and HD research is rooted in distinct technical philosophies:
- FV methodology originates in information retrieval and NLI, treating each claim as independent of model internals.
- HD’s recent emergence is tied to LLMs, leveraging introspective signals such as uncertainty and hidden activations, which FV benchmarks using static outputs cannot utilize.
- Static LLM output corpora are insufficient for modern, real-time, and white-box evaluation needs.
UniFact’s architectural unification—on-the-fly generation, evidence-anchored automated labeling, and unified evaluation—enables direct comparison and hybridization. Empirically, neither paradigm subsumes the other; their complementarity is numerically quantified, and hybridization consistently advances performance and robustness.
7. Synthesis and Implication
UniFact constitutes a scalable, model-agnostic framework that unifies HD and FV approaches for LLM factuality assessment. By aligning input data (dynamically generated LLM outputs), ground-truth labeling (automated, reference-anchored), and evaluation criteria (binary metrics applied to both HD and FV), UniFact reveals that internal model uncertainty and external fact grounding are complementary, not competing, methodologies. Only through their integration can the diverse and subtle failure modes of modern LLMs be robustly detected and mitigated (Su et al., 2 Dec 2025).