AVeriTeC: Automated Verification of Textual Claims

Updated 20 March 2026

AVeriTeC is an automated framework for verifying textual claims by mapping them to veracity labels and supporting evidence using advanced retrieval and inference methods.
The system employs a modular pipeline with query expansion, dense retrieval, and LLM-driven summarization to achieve notable gains in accuracy and runtime efficiency.
It addresses challenges such as evidence recall, temporal validity, and multi-hop reasoning, driving innovations in interpretability and practical fact verification deployment.

The Automated Verification of Textual Claims (AVeriTeC) encompasses the systems, datasets, and methodologies developed to support automatic fact verification of free-form, real-world claims using textual evidence from the open web or curated document stores. This field unites advances in information retrieval, natural language inference, LLMs, and evidence attribution, aiming to match or augment the journalistic workflow of professional fact-checkers. Characteristic challenges include sourcing context-independent claims, reliably retrieving temporally valid and sufficient evidence, and producing fine-grained, justified veracity judgments under strict efficiency and explainability constraints.

1. Task Definition, Datasets, and Evaluation Protocols

AVeriTeC formalizes automated claim verification as a mapping from a given claim $c$ (often context-independent and normalized by professional annotators) to both a veracity label $y \in Y$ (e.g., SUPPORTED, REFUTED, NOT_ENOUGH_EVIDENCE, CONFLICTING) and a supporting set of evidence items $E = \{(q_i, a_i)\}$ , with $q_i$ typically being a sub-question derived from $c$ and $a_i$ a supported textual answer grounded in the retrieval corpus (Schlichtkrull et al., 2023, Schlichtkrull et al., 2024).

Datasets such as AVeriTeC (Schlichtkrull et al., 2023), FEVER (Thorne et al., 2018), MultiFC (Augenstein et al., 2019), and WiCE (Kamoi et al., 2023) provide realistic claims annotated with multi-step questions, temporally-validated evidence, and justifications. Claims are broadly categorized (e.g., event/property, numerical, causal, quote, or position statement) and rigorously split by publication time to avoid temporal leakage.

Evaluation leverages metrics that jointly assess evidence and veracity quality. The primary AVeriTeC score (Schlichtkrull et al., 2024) computes, for each claim, whether the submitted system provides (i) a correct verdict and (ii) retrieved evidence whose QA pairs meet a METEOR-based similarity threshold ( $u_f \geq 0.25$ ) against gold reference pairs, solved via Hungarian matching. This ensures only systems delivering both correct labels and high-quality, focused evidence are rewarded.

2. Core System Architecture and Methodological Advances

AVeriTeC systems typically instantiate a multi-stage pipeline with the following canonical components:

Stage	Common Methods / Models	Representative Systems
Query Expansion	LLM-generated HyDE-FC prompts, QG	HerO 2 (Yoon et al., 15 Jul 2025), HerO (Yoon et al., 2024)
Evidence Retrieval	Dense (e.g., gte-base-en) and BM25; MMR	AIC CTU (Ullrich et al., 2024), TUDA_MAI (Schlichtkrull et al., 2024)
Document Summarization	LLM summarization for paragraph fusion	HerO 2 (Yoon et al., 15 Jul 2025)
Question Generation	Prompted/fine-tuned LLMs	HerO 2, HerO, AIC CTU, VILLAIN (Jung et al., 4 Feb 2026)
Answer Reformulation	LLM-based generation for evidence form	HerO 2
Veracity Prediction	Fine-tuned/flavored LLMs, BERT, NLI	HerO 2, HerO, AIC CTU, AMREx (Jayaweera et al., 2024)

A distinctive feature of recent systems is modularization, with each module (e.g., retrieval, summarization, question-gen) operating on minimal I/O to optimize both efficiency and interpretability (Yoon et al., 15 Jul 2025). State-of-the-art variants rely heavily on prompt-tuned or fine-tuned open or proprietary LLMs, often quantized (e.g., AWQ at 4-bit for Qwen3 32B) to fit within single-GPU VRAM budgets while retaining near-baseline accuracy (Yoon et al., 15 Jul 2025).

3. Critical Techniques: Summarization, Reformulation, and Quantization

Prominent systems systematically incorporate LLM-driven document summarization and answer reformulation between retrieval and downstream processing. Summarization reduces retrieved document fragments to self-contained paragraphs, boosting evidence recall (e.g., ≈15 points Ev2R improvement) and minimizing irrelevant context (Yoon et al., 15 Jul 2025). Answer reformulation (claim-conditioned) further sharpens evidence, contributing measurable gains (≈4 points Ev2R) with negligible runtime cost.

To ensure practical deployment, post-training quantization (notably AWQ), reducing 32B-parameter models to 4 bits, enables inference within a single A10G (23 GB VRAM) GPU, incurring only marginal accuracy loss ($0.692$ ACC quantized vs. unquantized) and unlocking high-capacity inference in real-world settings (Yoon et al., 15 Jul 2025).

4. System Performance and Leaderboard Analysis

Recent AVeriTeC shared tasks have seen steadily increasing upper bounds. The winning system (TUDA_MAI) achieved a 0.63 AVeriTeC score (Schlichtkrull et al., 2024), followed by HerO 2 (0.271) (Yoon et al., 15 Jul 2025), with AIC CTU and "yellow_flash" near parity but with dramatically higher latency (> 50 s/claim for CTU). HerO 2 is notable for achieving sub-30s mean runtime per claim, marking it as the most efficient among top-tier systems, largely due to its sequenced summarization and quantization design.

Pipeline augmentations in HerO 2 were included only when marginal verification or evidence-quality gains exceeded their computational cost. The system attains nearly top leaderboard rank while halving runtime against the best competitor (Yoon et al., 15 Jul 2025). Performance gaps between systems often trace to retrieval architecture, summarization capabilities, and choice of veracity classifier backbone.

5. Taxonomy of Justification and Evidence Attribution Methods

Justification generation is central to end-user trust and clinical utility. Approaches span:

Separated and joint veracity-justification architectures.
Chain-of-thought multi-hop QA, LLM-based summarization, and knowledge-graph or AMR mapping (Eldifrawi et al., 2024, Jayaweera et al., 2024).
Modalities include natural-language rationales, token highlights, and structured (e.g., SPO triple) proofs.

Recent surveys (Eldifrawi et al., 2024) emphasize the explainability spectrum: self-explainable (e.g., multi-step CoT, agent debate in VILLAIN (Jung et al., 4 Feb 2026)) versus non-self-explainable (end-to-end abstractive summary). AMREx (Jayaweera et al., 2024) demonstrates partial explainability via AMR node alignment, which can be synthesized into faithful natural-language explanations by constraining LLM generations with explicit graph mappings.

6. Efficiency, Bottlenecks, and Open Challenges

Evidence recall remains the critical bottleneck; systems that optimize upstream retrieval and summarization exhibit the largest verification score improvements (Yoon et al., 15 Jul 2025, Schlichtkrull et al., 2024). Overly aggressive retrieval, however, can dilute downstream NLI performance, necessitating careful balancing of relevance and diversity (e.g., MMR reranking in AIC CTU (Ullrich et al., 2024)).

Difficulties persist in handling numerical/categorical reasoning, coreference ambiguities, temporal entailment, and multi-evidence aggregation, as documented in classic and contemporary error analyses (Hanselowski et al., 2018, Yoon et al., 15 Jul 2025). Fine-grained error localization (e.g., unsupported-span detection (Kamoi et al., 2023)) and alignment of justification desiderata (completeness, faithfulness, coherence) remain largely unsolved.

Pragmatic constraints—cost and compute of LLM-based, multi-hop or debate architectures; model explainability; and dynamic reference drift in open-web evidence—continue to shape research prioritization (Eldifrawi et al., 2024).

7. Future Directions and Impact

The trajectory of AVeriTeC research is toward flexible, high-recall, interpretable, and resource-efficient fact verification. Priority directions include:

Unified architectures integrating live web search and robust knowledge stores (Schlichtkrull et al., 2024).
Modular systems refining evidence through lightweight summarization and reformulation (Yoon et al., 15 Jul 2025).
Advanced retrieval strategies (multi-hop, dense hybrids), cross-document and modal reasoning (as in VILLAIN (Jung et al., 4 Feb 2026)).
Dynamic, human-aligned evaluation metrics to replace or augment METEOR-based QA matching.
End-to-end explainable pipelines linking claim decomposition, minimal evidence attribution, and rigorous veracity classification (Kamoi et al., 2023, Dammu et al., 2024).
Model compression and quantization for widespread, real-time deployment.

AVeriTeC has catalyzed a rapid evolution in real-world fact verification systems, bridging the gap between academic datasets and the operational standards of professional journalism and open-society information integrity (Schlichtkrull et al., 2023, Schlichtkrull et al., 2024).