Papers
Topics
Authors
Recent
Search
2000 character limit reached

AVeriTeC Score: Automated Fact Verification Metric

Updated 20 March 2026
  • AVeriTeC Score is a metric for evaluating automated claim verification systems by jointly assessing QA evidence retrieval and label correctness.
  • It employs METEOR-based one-to-one matching with the Hungarian algorithm to compare predicted and gold QA pairs, enforcing strict evidence sufficiency via a threshold.
  • Empirical results show diverse score ranges across systems, motivating balanced retrieval and reasoning to surpass a 0.25 evidence alignment threshold.

AVeriTeC Score

The AVeriTeC score is a metric for evaluating automated claim verification systems, introduced with the AVeriTeC dataset and formalized in the AVeriTeC shared tasks (Schlichtkrull et al., 2023, Schlichtkrull et al., 2024, Malon, 2024, Singhal et al., 2024). It measures a system’s end-to-end ability to (1) retrieve sufficient evidence in the form of question–answer (QA) pairs that align semantically with annotated gold evidence, and (2) correctly predict the claim’s veracity, with credit awarded only if both evidence and label meet strict criteria. The AVeriTeC score has become the principal evaluation metric in open-web fact-checking tasks, directly incentivizing systems to demonstrate robust retrieval and faithful reasoning.

1. Formal Definition and Notation

For each claim ii, a system produces:

  • Up to 10 predicted QA pairs Y^i\hat{Y}_i
  • A predicted veracity label v^i{Supported,Refuted,NEE,CE}\hat{v}_i \in \{\text{Supported}, \text{Refuted}, \text{NEE}, \text{CE}\}

Let YiY_i denote the gold set of QA pairs for claim ii. The core matching mechanism quantitatively compares submitted and gold QA pairs using METEOR string similarity, optimized via the Hungarian algorithm for one-to-one assignment: uf(Y^i,Yi)=1YimaxXy^Y^iyYi[METEOR(y^,y)]X(y^,y)u_f(\hat{Y}_i, Y_i) = \frac{1}{|Y_i|} \max_X \sum_{\hat{y} \in \hat{Y}_i}\sum_{y \in Y_i} [\mathrm{METEOR}(\hat{y}, y)] X(\hat{y}, y) where XX defines the assignment matrix with no duplicate matches.

A threshold τ\tau (default 0.25) is applied to the matching score to define evidence sufficiency: si=1[uf(Y^i,Yi)τ]1[v^i=vi]s_i = \mathbf{1}[u_f(\hat{Y}_i, Y_i) \geq \tau] \cdot \mathbf{1}[\hat{v}_i = v_i^*] with viv_i^* the ground-truth label.

The system-level AVeriTeC score is then the mean over all NN claims: AVeriTeC=1Ni=1Nsi\mathrm{AVeriTeC} = \frac{1}{N}\sum_{i=1}^N s_i This is typically reported as a percentage.

2. Evaluation Pipeline and Stepwise Calculation

The following workflow encapsulates the AVeriTeC scoring procedure:

  1. QA Generation: For each claim, a system generates up to kk QA pairs (k=10k=10 by some shared task guidelines).
  2. Label Prediction: A veracity label is predicted using the retrieved evidence.
  3. Evidence Scoring: Gold and predicted QA pairs are compared via METEOR, and the best one-to-one matching is established using the Hungarian algorithm to compute the mean evidence alignment score.
  4. Thresholding: If the evidence alignment ufu_f achieves or exceeds threshold τ\tau, the claim can receive credit for a correct verdict; otherwise, it receives none.
  5. Aggregation: The final AVeriTeC score is the average across the test set:
    • Only claims with both sufficient evidence (ufτu_f \geq \tau) and a correct label count as successful.

3. Rationale and Metric Design Choices

The AVeriTeC score enforces strict joint evaluation criteria to address core limitations in previous fact-checking metrics:

  • Evidence Precision: Systems cannot achieve high scores by label guessing; substantial and accurate evidence (as measured against human annotations) is prerequisite.
  • Robustness to Paraphrasing and Reasoning: The use of METEOR promotes tolerance to paraphrases and supports systems that generate equivalent though not verbatim evidence. The QA formalism tests systems’ ability to decompose claims into subquestions and answers, targeting intermediate reasoning steps.
  • Thresholding Mechanism: The hard threshold at τ=0.25\tau=0.25 prevents partial, low-quality, or spurious evidence from influencing the main score, focusing evaluation on robust retrieval.

Compared to prior metrics such as the binary FEVER score (which focused on evidence-sentence recall from a closed corpus), AVeriTeC's soft matching and gating mechanism are specifically adapted for open-web information retrieval and generative modeling (Malon, 2024).

4. Interpretation of Typical Score Ranges

Observed AVeriTeC scores across baseline and advanced systems serve as empirical anchors for interpretation:

  • Baseline systems: 0.11 (official FEVER-24 baseline, knowledge-store retrieval) (Schlichtkrull et al., 2024, Singhal et al., 2024)
  • Top leaderboard systems: 0.57 (HerO) (Yoon et al., 2024), 0.63 (TUDA_MAI) (Schlichtkrull et al., 2024)
  • Systems combining strong retrieval and reasoning: 0.477–0.510 (Papelo) (Malon, 2024)
  • Ablation studies: Gains in retrieval and QA generation translate directly to AVeriTeC improvement, but only insofar as both label prediction and evidence surpass the threshold.
  • Human agreement estimate: A human–human upper bound for Q+A matching yields METEOR ≈ 0.22, motivating the 0.25 threshold as a realistic competence cut-off (Schlichtkrull et al., 2023).
  • Interpretation: Scores above 0.3 generally indicate highly performant systems; declines below 0.10 signal systematic retrieval or reasoning deficits (Schlichtkrull et al., 2023, Singhal et al., 2024).

5. Core Variables and Thresholding

Symbol Definition Typical Value
Y^i\hat{Y}_i System-predicted QA pairs (per claim) 10\leq 10
YiY_i Gold QA pairs (per claim) $1$–$5$
METEOR String similarity for QA pair matching [0,1][0, 1]
uiu_i Hungarian-matched average METEOR for claim ii 0.2\sim0.2–$0.35$
τ\tau Evidence sufficiency threshold $0.25$
sis_i Per-claim AVeriTeC indicator $0$ or $1$
Overall Score Mean of sis_i over test claims (NN) [0,1][0,1]

Thresholding determines eligibility for label credit. If ui<τu_i < \tau, label correctness is ignored; if uiτu_i \ge \tau, only correct labels are scored. This design creates strong incentive for balanced, evidence-backed fact-checking.

6. Empirical Results and System Comparisons

Case studies from reported systems contextualize performance:

Table: Select AVeriTeC Scores from Recent Systems

System AVeriTeC Score Reference
Baseline 0.11 (Schlichtkrull et al., 2024)
RAG + ICL 0.33 (Singhal et al., 2024)
Papelo 0.51 / 0.477 (Malon, 2024)
HerO 0.578 (Yoon et al., 2024)
TUDA_MAI 0.63 (Schlichtkrull et al., 2024)

7. Extensions and Limitations

The AVeriTeC scoring approach has inspired related evaluation schemes in multimodal fact-checking (e.g., AVerImaTeC score), which generalize the conditional accuracy mechanism to evidence judged by multimodal similarity at a tunable threshold (Cao et al., 11 Feb 2026). Core limitations observed:

  • Threshold Sensitivity: The τ\tau hyperparameter can significantly affect outcomes, with systems clustered just below or above receiving sharply different scores (Schlichtkrull et al., 2024, Cao et al., 11 Feb 2026).
  • Class Imbalance: Classes with low prevalence (e.g., “Not Enough Evidence,” “Conflicting”) lead to unstable system-level performance and under-represented abilities (Schlichtkrull et al., 2024, Cao et al., 11 Feb 2026).
  • Automatic Evaluation Coarseness: METEOR-based similarity, while robust to paraphrasing, may not fully capture fine-grained faithfulness, particularly in adversarial retrieval or when evidence is distributed over many low-similarity fragments (Malon, 2024).

References

  • M. Schlichtkrull et al. "AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web" (Schlichtkrull et al., 2023)
  • M. Schlichtkrull et al. "The Automated Verification of Textual Claims (AVeriTeC) Shared Task" (Schlichtkrull et al., 2024)
  • Papelo Team. "Multi-hop Evidence Pursuit Meets the Web: Team Papelo at FEVER 2024" (Malon, 2024)
  • Ronit Singhal et al. "Evidence-backed Fact Checking using RAG and Few-Shot In-Context Learning with LLMs" (Singhal et al., 2024)
  • Cao et al. "The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task" (Cao et al., 11 Feb 2026)
  • HerO Team/SSU. "HerO at AVeriTeC: The Herd of Open LLMs for Verifying Real-World Claims" (Yoon et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AVeriTeC Score.