AVeriTeC Score: Automated Fact Verification Metric

Updated 20 March 2026

AVeriTeC Score is a metric for evaluating automated claim verification systems by jointly assessing QA evidence retrieval and label correctness.
It employs METEOR-based one-to-one matching with the Hungarian algorithm to compare predicted and gold QA pairs, enforcing strict evidence sufficiency via a threshold.
Empirical results show diverse score ranges across systems, motivating balanced retrieval and reasoning to surpass a 0.25 evidence alignment threshold.

The AVeriTeC score is a metric for evaluating automated claim verification systems, introduced with the AVeriTeC dataset and formalized in the AVeriTeC shared tasks (Schlichtkrull et al., 2023, Schlichtkrull et al., 2024, Malon, 2024, Singhal et al., 2024). It measures a system’s end-to-end ability to (1) retrieve sufficient evidence in the form of question–answer (QA) pairs that align semantically with annotated gold evidence, and (2) correctly predict the claim’s veracity, with credit awarded only if both evidence and label meet strict criteria. The AVeriTeC score has become the principal evaluation metric in open-web fact-checking tasks, directly incentivizing systems to demonstrate robust retrieval and faithful reasoning.

1. Formal Definition and Notation

For each claim $i$ , a system produces:

Up to 10 predicted QA pairs $\hat{Y}_i$
A predicted veracity label $\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}$

Let $Y_i$ denote the gold set of QA pairs for claim $i$ . The core matching mechanism quantitatively compares submitted and gold QA pairs using METEOR string similarity, optimized via the Hungarian algorithm for one-to-one assignment: $u_f(\hat{Y}_i, Y_i) = \frac{1}{|Y_i|} \max_X \sum_{\hat{y} \in \hat{Y}_i}\sum_{y \in Y_i} [\mathrm{METEOR}(\hat{y}, y)] X(\hat{y}, y)$ where $X$ defines the assignment matrix with no duplicate matches.

A threshold $\tau$ (default 0.25) is applied to the matching score to define evidence sufficiency: $s_i = \mathbf{1}[u_f(\hat{Y}_i, Y_i) \geq \tau] \cdot \mathbf{1}[\hat{v}_i = v_i^*]$ with $v_i^*$ the ground-truth label.

The system-level AVeriTeC score is then the mean over all $\hat{Y}_i$ 0 claims: $\hat{Y}_i$ 1 This is typically reported as a percentage.

2. Evaluation Pipeline and Stepwise Calculation

The following workflow encapsulates the AVeriTeC scoring procedure:

QA Generation: For each claim, a system generates up to $\hat{Y}_i$ 2 QA pairs ( $\hat{Y}_i$ 3 by some shared task guidelines).
Label Prediction: A veracity label is predicted using the retrieved evidence.
Evidence Scoring: Gold and predicted QA pairs are compared via METEOR, and the best one-to-one matching is established using the Hungarian algorithm to compute the mean evidence alignment score.
Thresholding: If the evidence alignment $\hat{Y}_i$ 4 achieves or exceeds threshold $\hat{Y}_i$ 5, the claim can receive credit for a correct verdict; otherwise, it receives none.
Aggregation: The final AVeriTeC score is the average across the test set:
- Only claims with both sufficient evidence ( $\hat{Y}_i$ 6) and a correct label count as successful.

3. Rationale and Metric Design Choices

The AVeriTeC score enforces strict joint evaluation criteria to address core limitations in previous fact-checking metrics:

Evidence Precision: Systems cannot achieve high scores by label guessing; substantial and accurate evidence (as measured against human annotations) is prerequisite.
Robustness to Paraphrasing and Reasoning: The use of METEOR promotes tolerance to paraphrases and supports systems that generate equivalent though not verbatim evidence. The QA formalism tests systems’ ability to decompose claims into subquestions and answers, targeting intermediate reasoning steps.
Thresholding Mechanism: The hard threshold at $\hat{Y}_i$ 7 prevents partial, low-quality, or spurious evidence from influencing the main score, focusing evaluation on robust retrieval.

Compared to prior metrics such as the binary FEVER score (which focused on evidence-sentence recall from a closed corpus), AVeriTeC's soft matching and gating mechanism are specifically adapted for open-web information retrieval and generative modeling (Malon, 2024).

4. Interpretation of Typical Score Ranges

Observed AVeriTeC scores across baseline and advanced systems serve as empirical anchors for interpretation:

Baseline systems: 0.11 (official FEVER-24 baseline, knowledge-store retrieval) (Schlichtkrull et al., 2024, Singhal et al., 2024)
Top leaderboard systems: 0.57 (HerO) (Yoon et al., 2024), 0.63 (TUDA_MAI) (Schlichtkrull et al., 2024)
Systems combining strong retrieval and reasoning: 0.477–0.510 (Papelo) (Malon, 2024)
Ablation studies: Gains in retrieval and QA generation translate directly to AVeriTeC improvement, but only insofar as both label prediction and evidence surpass the threshold.
Human agreement estimate: A human–human upper bound for Q+A matching yields METEOR ≈ 0.22, motivating the 0.25 threshold as a realistic competence cut-off (Schlichtkrull et al., 2023).
Interpretation: Scores above 0.3 generally indicate highly performant systems; declines below 0.10 signal systematic retrieval or reasoning deficits (Schlichtkrull et al., 2023, Singhal et al., 2024).

5. Core Variables and Thresholding

Symbol	Definition	Typical Value
$\hat{Y}_i$ 8	System-predicted QA pairs (per claim)	$\hat{Y}_i$ 9
$\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}$ 0	Gold QA pairs (per claim)	$\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}$ 1– $\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}$ 2
METEOR	String similarity for QA pair matching	$\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}$ 3
$\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}$ 4	Hungarian-matched average METEOR for claim $\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}$ 5	$\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}$ 6– $\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}$ 7
$\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}$ 8	Evidence sufficiency threshold	$\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}$ 9
$Y_i$ 0	Per-claim AVeriTeC indicator	$Y_i$ 1 or $Y_i$ 2
Overall Score	Mean of $Y_i$ 3 over test claims ( $Y_i$ 4)	$Y_i$ 5

Thresholding determines eligibility for label credit. If $Y_i$ 6, label correctness is ignored; if $Y_i$ 7, only correct labels are scored. This design creates strong incentive for balanced, evidence-backed fact-checking.

6. Empirical Results and System Comparisons

Case studies from reported systems contextualize performance:

Baseline: 0.11 (knowledge store baseline; $Y_i$ 8 on most claims) (Schlichtkrull et al., 2024, Singhal et al., 2024)
RAG + ICL system: 0.33 (improved retrieval, question generation, and classification) (Singhal et al., 2024)
Leaderboard winners: 0.51/0.477 (Papelo) (Malon, 2024), 0.578 (HerO) (Yoon et al., 2024), 0.63 (TUDA_MAI) (Schlichtkrull et al., 2024)
Oracle upper bound (gold evidence): 0.49 (Schlichtkrull et al., 2023)

Table: Select AVeriTeC Scores from Recent Systems

System	AVeriTeC Score	Reference
Baseline	0.11	(Schlichtkrull et al., 2024)
RAG + ICL	0.33	(Singhal et al., 2024)
Papelo	0.51 / 0.477	(Malon, 2024)
HerO	0.578	(Yoon et al., 2024)
TUDA_MAI	0.63	(Schlichtkrull et al., 2024)

7. Extensions and Limitations

The AVeriTeC scoring approach has inspired related evaluation schemes in multimodal fact-checking (e.g., AVerImaTeC score), which generalize the conditional accuracy mechanism to evidence judged by multimodal similarity at a tunable threshold (Cao et al., 11 Feb 2026). Core limitations observed:

Threshold Sensitivity: The $Y_i$ 9 hyperparameter can significantly affect outcomes, with systems clustered just below or above receiving sharply different scores (Schlichtkrull et al., 2024, Cao et al., 11 Feb 2026).
Class Imbalance: Classes with low prevalence (e.g., “Not Enough Evidence,” “Conflicting”) lead to unstable system-level performance and under-represented abilities (Schlichtkrull et al., 2024, Cao et al., 11 Feb 2026).
Automatic Evaluation Coarseness: METEOR-based similarity, while robust to paraphrasing, may not fully capture fine-grained faithfulness, particularly in adversarial retrieval or when evidence is distributed over many low-similarity fragments (Malon, 2024).

References

M. Schlichtkrull et al. "AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web" (Schlichtkrull et al., 2023)
M. Schlichtkrull et al. "The Automated Verification of Textual Claims (AVeriTeC) Shared Task" (Schlichtkrull et al., 2024)
Papelo Team. "Multi-hop Evidence Pursuit Meets the Web: Team Papelo at FEVER 2024" (Malon, 2024)
Ronit Singhal et al. "Evidence-backed Fact Checking using RAG and Few-Shot In-Context Learning with LLMs" (Singhal et al., 2024)
Cao et al. "The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task" (Cao et al., 11 Feb 2026)
HerO Team/SSU. "HerO at AVeriTeC: The Herd of Open LLMs for Verifying Real-World Claims" (Yoon et al., 2024)