TruthfulQA Scores

Updated 12 May 2026

TruthfulQA Scores are metrics that assess LLM factual accuracy under adversarial conditions using diverse scoring protocols such as human judgment, multiple-choice, and LLM-as-a-judge.
They reveal inverse scaling issues where larger models often mirror widespread misconceptions, underscoring challenges in mitigating hallucinations and symbolic encoding failures.
Methodological interventions including Direct Preference Optimization, truth forests, and pointwise re-ranking offer significant gains in model calibration and truthfulness.

TruthfulQA scores quantify the factual accuracy of LLMs under adversarial conditions that are explicitly designed to elicit imitative falsehoods—responses that closely mimic widespread but incorrect beliefs in human-written text. The benchmark was introduced to expose models’ proclivity for confidently generating plausible yet factually false content. Its scoring regime, empirical performance results, and subsequent influence on the LLM evaluation and alignment literature have established TruthfulQA as a central metric for measuring and improving model truthfulness.

1. TruthfulQA Benchmark Structure and Scoring Protocols

TruthfulQA consists of 817 natural language questions spanning 38 topical categories (e.g., health, law, finance, conspiracies, superstitions, fiction) and is adversarially filtered such that both humans and models often answer incorrectly due to entrenched misconceptions (Lin et al., 2021). For each question, there are multiple reference “true” and “false” answers with supporting sources. Several scoring methods are deployed for different experimental purposes:

Human-Judged Generation Score: For a model $M$ producing answers $a_i$ to all questions, the primary binary truthfulness score is

$\mathrm{TruthfulQA}(M) = \frac{1}{N} \sum_{i=1}^N \mathbb{1}[T(a_i) \ge 0.5]$

where $T(a_i) \in [0,1]$ is a human-assigned truthfulness label per answer and thresholded at $0.5$.

Multiple-Choice Accuracy (MC1/MC2):
- MC1: Exactly one correct of four choices. Score is the fraction of times the model’s top-probability answer matches a correct option:
$\mathrm{MC1} = \frac{1}{N} \sum_{i=1}^N \mathbf{1} \left( \arg\max_{a \in \mathcal{A}_i} P(a|q_i) \in \mathcal{C}_i \right)$ - MC2: Multiple correct among four choices. The score is the average total probability assigned to true options, normalized over all options (Chen et al., 2024).
LLM-as-a-Judge: A separately trained LLM $f_\theta$ classifies each generated answer as truthful or not. Score is the mean output of $f_\theta$ on all answers (Figueras et al., 13 Feb 2025).

These scoring approaches are used with both open-ended generations and multiple-choice (MC1/MC2) evaluation harnesses.

2. Empirical Results: Model-Level Scores and Scaling Trends

2.1 Historical Baselines and Inverse Scaling

Human Baseline: 94% truthful (binary), 87% truthful and informative with internet access (Lin et al., 2021).
GPT-3-175B (helpful prompt): 58% truthful, 21% truthful and informative (human-judged, generative).
Earlier LLMs: Smaller models outperformed larger ones in truthfulness due to larger models' propensity to echo web-scale misconceptions—an instance of “inverse scaling.” For example, GPT-3-175B (QA-prompt) scored 21% versus GPT-3-350M’s 37% (Lin et al., 2021).

2.2 State-of-the-Art Results and Methodological Advances

Recent interventions have substantially improved TruthfulQA outcomes:

Model/Method	MC1 (%)	MC2 (%)	Generation True %†	Halluc. Rate (%)	Judge-LLM (%)	Source
LLaMA-2-7B (pretrained)	30.2	45.3	40.8	–	–	(Chen et al., 2024, Chen et al., 2023)
LLaMA-2-7B (+Truth Forest)	–	–	74.5	–	–	(Chen et al., 2023)
Llama2-Chat-7B (+GRATH)	54.7	69.1	–	–	–	(Chen et al., 2024)
Gemma-2-27B-It (en, Judge)	–	–	–	–	84.0	(Figueras et al., 13 Feb 2025)
GPT-4 (generation acc.)	–	–	46.0	–	–	(Tian et al., 2023)
Best current open "base" models	~44–48	–	–	–	~61	(Figueras et al., 13 Feb 2025)

†“Generation True %” refers to the fraction of GPT-judge-labeled factual answers.

2.3 Consensus and Calibration

Graph-attention consensus across multiple LLMs yields 50.1% accuracy and reduces hallucination rate to 26.1% (from 33.7% for the best single model) (Kallem, 12 Jan 2026).
RLHF models such as GPT-3.5-turbo and GPT-4 achieve 42–46% generation accuracy but are poorly calibrated by default. Verbalized confidences and temperature scaling yield expected calibration error (ECE) reductions up to 72%, with no change in accuracy (Tian et al., 2023).

3. Category-Level and Error-Type Analysis

TruthfulQA exposes particular weaknesses across semantic and symbolic categories:

GPT-3-175B (“helpful”): ~58% truthful overall, but only 20–30% in categories involving superstitions, myths, and folk logic.
LLaMA-2-7B: Baseline category-wise True % sharply increases to ~70–80% across all 38 categories under Truth Forest intervention (Chen et al., 2023).
"Surprisingly likely" answer selection (pointwise mutual information scoring per (Goel, 2023)) increases scores by up to 24 percentage points overall and up to 70 pp in categories such as “Confusion: People” and “Science.”
“Symbolic triggers”—modifiers, negation, numbers, exceptions, and named entities—pose persistent challenges, with near-100% hallucination rates for small Gemma and Llama models, improving only modestly with scaling (Lamba et al., 18 Nov 2025, Lamba et al., 9 Sep 2025).

4. Failure Modes, Hallucination Analysis, and Structural Correlates

Hallucination: Defined as the fraction of factually incorrect, plausible outputs per symbolic property or in aggregate. For Gemma-2-2B, hallucination rates for symbolic triggers (negation, numbers, etc.) persist at 95–100%, decreasing to ~90% in Gemma-2-27B (Lamba et al., 18 Nov 2025, Lamba et al., 9 Sep 2025).
Internal Mechanisms: Early layers (2–4) in transformer models exhibit sharp attention-variance spikes to symbolic tokens, upstream of near-perfect hallucination on adversarial TruthfulQA inputs. This instability in early symbolic semantic encoding is robust to model scaling and architecture, implicating architectural weaknesses rather than sampling or decoding failures (Lamba et al., 18 Nov 2025).
Entropy and Truthfulness in SLMs: Models with low and stable output entropy and more concentrated attention over decoding steps (e.g., DeepSeek-1.5B) produce more truthful responses, while “exploratory” models (Gemma-1B) with increasing uncertainty yield the highest hallucination rates (Adeseye et al., 4 Apr 2026).

5. Multilingual TruthfulQA: Cross-Lingual Robustness and Limitations

Professionally translated TruthfulQA variants in Basque, Catalan, Galician, and Spanish reveal the highest truthfulness scores in English (e.g., model average 76.5%) and the lowest in Basque (55.1%). Gaps are smaller than anticipated, but persist, especially in low-resource settings (Figueras et al., 13 Feb 2025). Machine translation (MT) of the benchmark delivers statistically equivalent scoring to professional translation.
Universal factual questions are nearly solved (∼90% truthfulness) across languages. In contrast, time- or context-dependent questions remain challenging (∼60–70%) (Figueras et al., 13 Feb 2025). A plausible implication is that “universal fact” benchmarks rapidly saturate and do not represent a limiting case for multilingual LLM factuality.

6. Algorithmic and Methodological Interventions to Improve TruthfulQA

Direct Preference Optimization (DPO) and Iterative Pairwise Truthfulness Training
- GRATH framework combines out-of-domain question augmentation, pairwise generation of valid/invalid answers, and DPO loss to lift LLaMA2-Chat-7B’s MC1 from 30.2% to 54.7%—surpassing all comparative 70B models (Chen et al., 2024).
Orthogonal Probing and Random Peek (Truth Forest)
- Inference-time interventions using multi-dimensional, orthogonal “truth” probes and random sampling of internal representations boost pre-trained LLaMA-2-7B from 40.8% to 74.5% True % (GPT-judge) (Chen et al., 2023).
Pointwise Mutual Information Re-Ranking
- The “surprisingly likely” method—selecting candidates with the highest $p(\mathrm{answer}|q)/p(\mathrm{answer}|?)$ —produces up to 24 percentage points aggregate gain and up to 70 pp in key categories, robust to quantization and model scale (Goel, 2023).
Verbalized Calibration and Confidence Extraction
- Explicitly prompting RLHF-tuned models to output numeric or verbalized confidence systematically improves calibration (ECE, Brier score) by 50–72% without touching top-1 accuracy, allowing safe deferral in deployment (Tian et al., 2023).
Consensus Reasoning Engines
- Multi-LLM graph-attention consensus boosts accuracy by 5 points and drops hallucination 22–28% relative to best single-model or majority-vote, with reliability gains confirmed by calibration curves (Kallem, 12 Jan 2026).

7. Implications, Open Problems, and Recommendations

TruthfulQA exposes inherent limits of next-token-prediction as a training paradigm: scaling alone may degrade truthfulness as larger models become more adept at reproducing human misconceptions (Lin et al., 2021). Persistent symbolic-encoding failures, especially for triggers like negation, modifiers, and numbers, point to structural deficiencies in current architectures that are not easily addressed by training on larger or more diverse corpora (Lamba et al., 18 Nov 2025). Algorithmic interventions using truth-focused pairwise training, internal probing, and external consensus offer substantial gains, but have yet to entirely close the gap with human reliability.

Emergent research shows that even with high overall truthfulness, category- and property-specific weaknesses remain, particularly for context-sensitive, symbolic, or adversarial inputs. Cross-lingual results suggest that comprehensive evaluation must move beyond universal facts toward context- and language-specific knowledge, with robust, scalable evaluation pipelines possible via machine translation (Figueras et al., 13 Feb 2025).

A plausible implication is that future progress on TruthfulQA—and, by extension, reliable factuality in open-domain LLMs—will depend on architectural innovations that enforce symbolic linguistic reasoning, targeted adversarial evaluation, calibrated uncertainty elicitation, and effective consensus aggregation across base and instruction-tuned models.