Vectara's Hallucination Leaderboard

Updated 2 February 2026

Vectara’s Hallucination Leaderboard is a framework designed to systematically measure hallucination and refusal rates in retrieval-augmented generation tasks.
It employs both the automated HHEM and the human-guided FaithJudge, enabling consistent evaluation of model faithfulness across closed- and open-source LLMs.
The leaderboard leverages fixed news article datasets to facilitate reproducible, apples-to-apples model comparisons and longitudinal tracking of summary integrity.

Vectara’s Hallucination Leaderboard is a publicly available benchmarking framework designed for systematic evaluation, comparison, and longitudinal tracking of hallucination rates in LLMs in retrieval-augmented generation (RAG) pipelines. It further enables practitioners and researchers to evaluate both closed-source and open-source models using consistent grounded summarization tasks, providing actionable diagnostics on model faithfulness and refusal behavior. The leaderboard has evolved from employing the Hughes Hallucination Evaluation Model (HHEM) as its principal automated detector to integrating FaithJudge, a few-shot LLM-as-a-judge approach guided by human annotations, which significantly increases the alignment of automated hallucination detection with reference human labels (Tamber et al., 7 May 2025).

1. Motivation, Design, and Task Scope

Retrieval-augmented generation (RAG) is intended to reduce hallucinations by conditioning LLM outputs on explicit evidentiary context. However, LLMs still frequently produce unsupported or contradictory information when summarizing documents, a phenomenon labeled “hallucination.” Vectara’s Hallucination Leaderboard was created to:

Quantify, compare, and track hallucination rates of numerous LLMs in a fixed, realistic RAG summarization setting.
Provide a continually updated benchmark supporting new model releases.
Offer practitioners direct diagnostics (hallucination rate, refusal rate) on model faithfulness.

Design choices include a fixed, curated set of approximately 130 diverse news articles from BBC, CNN, Wikipedia, and Daily Mail (median length ≈217 words, IQR [42, 424]). Each model is prompted to generate a concise, grounded summary per article; these are then automatically analyzed. The primary leaderboard metrics are hallucination rate and refusal rate, facilitating apples-to-apples ranking as new models and detectors develop (Tamber et al., 7 May 2025).

2. Hallucination Detection Engine: HHEM

The Hughes Hallucination Evaluation Model (HHEM) is a fine-tuned transformer trained on aggregated claims-level annotations, including data from RAGTruth. HHEM operates in two principal classification modes:

Claim-wise: Summaries are decomposed into atomic claims, each labeled Supported or Unsupported.
Summary-wise: The entire summary is classified as Consistent or Inconsistent.

HHEM outputs a binary consistency label $y \in \{0,1\}$ per instance. The hallucination rate for model $m$ is given by: $\mathrm{HallucinationRate}_m = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(\text{HHEM flags summary %%%%2%%%% as inconsistent})$ where $N$ is the size of the article set. Refusal rate is defined as the proportion of outputs with token length $\leq 5$ : $\mathrm{RefusalRate}_m = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(|\mathrm{output}_i| \leq 5)$ Models are ranked by ascending hallucination rate, with side-by-side reporting of refusal rates.

3. Methodology for Benchmarking and Ranking

For each LLM, summaries are generated using a fixed prompt requesting concise, evidence-grounded responses. The HHEM-2.1-open detector (110M parameters) is automatically applied, yielding model-level metrics. The leaderboard ranks all evaluated models—stratified by family (GPT, Gemini, Llama, Mistral)—by increasing hallucination rate. The dataset is static but may be periodically augmented; all test examples are filtered for content appropriateness. This process provides a reproducible empirical foundation for model assessment, crucial for both academic and applied RAG use cases (Tamber et al., 7 May 2025).

4. Empirical Findings and HHEM Limitations

Several challenges with HHEM and similar detectors have emerged:

Detection ceiling: On adversarial benchmarks like FaithBench, HHEM-2.1 achieves only ≈67% balanced accuracy (claim-wise), and ≈66% (summary-wise).
Underdetection of nuanced hallucinations: "Benign" and "Questionable" cases tend to be missed.
Generalization failure: Accuracy decreases on model outputs from generations absent in HHEM’s training data.
LLM-as-judge (zero-shot) limits: Methods such as GPT-4o and FACTS-Grounding plateau below 78% balanced accuracy under similar adversarial conditions (Tamber et al., 7 May 2025).

These limitations motivated a move toward enhanced evaluation rooted in direct human supervision.

5. FaithJudge: Few-Shot LLM-as-a-Judge with Human Supervision

FaithJudge addresses HHEM’s shortcomings by guiding an LLM-judge via few-shot exemplars annotated by humans. Key components:

Human-annotated few-shot exemplars: For each article, 10 LLM-generated summaries are annotated at the span level with hallucination severities ("Unwanted," "Benign," "Questionable," "Consistent").
Prompt structure: FaithJudge supplies these exemplars and the candidate summary to be evaluated. The LLM identifies unsupported or contradictory spans, reasons about severity, then outputs a binary Consistent/Inconsistent label with justification.
Scoring: The FaithJudge hallucination rate is

$\mathrm{FaithJudgeRate}_m = \frac{1}{N}\sum_{i=1}^N y_J(s_i)$

where $y_J$ is the FaithJudge judgment per summary.

Ensembling: When multiple judges (e.g., o3-mini-high, Gemini-2.0, Llama-4) are available, decisions may be aggregated by majority vote: $y_{\mathrm{Ensemble}} = \mathrm{mode}\{y_{J_1}, y_{J_2}, y_{J_3}\}$

FaithJudge replaces or supplements the HHEM hallucination rate, and task-level stratification is introduced (summarization, QA, data-to-text) (Tamber et al., 7 May 2025).

6. Impact of Enhanced Leaderboard and Empirical Results

FaithJudge achieves substantial empirical gains over HHEM and naive LLM-judge baselines:

FaithBench detection: HHEM-2.1 (summary-wise) reports balanced accuracy of 52.6%, F1 of 32.9%. The o3-mini-high LLM-judge reaches 68.8%/60.7% (bal acc/F1). FaithJudge itself (few-shot, o3-mini-high) achieves 84.0% bal acc and 82.1% F1.
Ranking stability: Using the ground-truth summary order on FaithBench, the original HHEM leaderboard produces 16 inversions; FaithJudge reduces this to 6.
Multi-task generality: In RAGTruth-QA and Data2Txt tasks, FaithJudge attains $\geq 85\%$ balanced accuracy and $m$ 0 F1.
Model differentiation: FaithJudge detects finer performance splits; for example, Gemini-2.0-Flash (≈7.6% hallucination rate) outperforms GPT-4.5-Preview (≈12%) in aggregate rankings.
Leaderboard layout: Results are now partitioned by task type and model group, improving interpretability (Tamber et al., 7 May 2025).

7. Significance and Broader Implications

Vectara’s Hallucination Leaderboard, with the FaithJudge enhancement, now defines a high-fidelity standard for scalable LLM faithfulness assessment in RAG and related settings. The combination of:

Strong human alignment (task-specific few-shot guidance),
Improved adversarial robustness (84% bal acc on FaithBench),
Tighter correspondence with reference human judgments (∼60% reduction in ranking errors vs. HHEM),
Multi-task and multi-family benchmarking,

positions FaithJudge as a benchmark for automated hallucination evaluation. By stratifying tasks and exposing detailed diagnostics, the leaderboard supports both academic comparison and practical model selection, and directly motivates further research in detection, mitigation, and grounding of LLM-generated outputs (Tamber et al., 7 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vectara's Hallucination Leaderboard.