Multimodal Claim Verification

Updated 21 November 2025

Multimodal claim verification is the computational task of assessing claim veracity using diverse evidence modalities such as text, images, tables, and videos.
Benchmarks and datasets like MuSciClaims and Factify 2 validate model performance through detailed taxonomies and cross-modal evidence integration.
State-of-the-art models utilize advanced vision-language architectures and multi-hop reasoning yet still fall short of human accuracy in nuanced evidence evaluation.

Multimodal claim verification is the computational task of determining the veracity of a natural-language claim with respect to evidence drawn from multiple modalities such as text, images, tables, charts, and, more recently, video or audio. Unlike text-only fact-checking, multimodal claim verification evaluates the support, contradiction, or irrelevance of evidence that may be partially or fully realized in non-textual forms. This task has become prominent due to the increasing proliferation of multimodal misinformation in scientific literature, news, and social media, as well as the recognized need for robust tools that reason over heterogeneous evidence sources.

1. Formal Task Definition and Taxonomy

Multimodal claim verification requires as input a claim $C$ and an associated evidence set $E$ comprising text $T$ , images $I$ , tables $Tab$ , and potentially additional modalities. The aim is to assign a discrete label $y$ , typically from the set $\{\mathrm{Support},\,\mathrm{Contradict/Refute},\,\mathrm{Neutral/NotEnoughInfo}\}$ or finer-grained schemas such as the five-way taxonomy:

$y \in \{\text{Support\_Multimodal},\,\text{Support\_Text},\,\text{Insufficient\_Multimodal},\,\text{Insufficient\_Text},\,\text{Refute}\}$

The mapping function

$f_\theta : (C, E) \rightarrow y$

is optimized via cross-entropy loss. Diverse instantiations exist: classification over paired (claim, figure, caption) triplets as in scientific literature (Lal et al., 5 Jun 2025), multi-hop reasoning across several evidence items $E(C) = T \cup I \cup Tab$ (Wang et al., 2024), and multitask pipelines involving auxiliary tasks such as evidence retrieval and explanation generation (Yao et al., 2022).

2. Benchmarks and Dataset Construction

Recent years have seen the emergence of several multimodal claim verification benchmarks, each differing in evidence composition, domain, linguistic diversity, and annotation protocol.

Representative Datasets

Benchmark	Samples	Modalities	Labels	Domain
MuSciClaims	918	Figures (multi-panel images) + captions	Support, Neutral, Contradict	Life Sciences
SciVer	3,000	Text, figures, tables	Entailed, Refuted	CS Papers
MMCV	15,569	Text, images, tables	SUPPORT, REFUTE	Wikipedia
M4FC	6,980	Images, claim text, evidence, geolocation	True, False (+5 subtasks)	Real World
Factify 2 (Kishore et al., 7 Aug 2025)	42,500	Claim text, image, OCR/retrieved text	5-way (see above)	Web/News
FACTIFY 3M	3M	Text, images (original/generated), 5W QA	5-way	Social Media
MIVA	1,362	Video, text, structured game metadata	TRUE, FALSE, NEUTRAL	Social Game

Dataset construction involves strategies including:

Automatic extraction of “supported” claims via reference/figure detection in paper PDFs (Lal et al., 5 Jun 2025), news mining (Chakraborty et al., 2023), or LLM-based claim generation (Wang et al., 2024);
Manual or semi-automated perturbation of claims to synthesize contradiction or neutral cases (e.g., controlled editing of adjectives, relationships, or numeric values);
Diagnostic and provenance meta-annotation, such as fine-grained evidence highlighting, panel localization, and 5W QA (Lal et al., 5 Jun 2025, Chakraborty et al., 2023);
Human validation and expert annotation for claim–evidence relationship and explanation (Yao et al., 2022, Wang et al., 18 Jun 2025).

Balanced class design and adversarial instance injection (e.g., synthetic fake news or visual paraphrases) are employed to mitigate model-leaning and data shortcut issues (Chakraborty et al., 2023).

3. Model Architectures and Learning Paradigms

Architectures for multimodal claim verification integrate vision-language modules, retrieval components, structured fusion, and (in some cases) reasoning augmentation.

End-to-End Vision-LLMs

Encode claims and textual evidence (often with a pretrained transformer, e.g., RoBERTa, DeBERTa, SBERT).
Encode images and figures with CNNs (ResNet, Swin, ViT) or CLIP-like models.
Fuse modalities via element-wise interactions (difference, product, concatenation) or graph-based mechanisms (Cao et al., 2024), optionally extending to global guided graph attention (KGF) to leverage entity/object knowledge.
Classification head predicts the relationship label; contrastive objectives (InfoNCE) may be added for paired claim–evidence semantic alignment (Kishore et al., 7 Aug 2025).

Multi-Hop and Multitask Settings

Multi-hop models ingest sets $E(C)$ and utilize cross-modal attention, graph reasoning, or prompting scaffolds (chain-of-thought, self-ask, symbolic-guided) to perform multi-step aggregation (Wang et al., 2024).
Pipeline approaches sequentialize extraction, intent detection, evidence retrieval, claim verification, and context/question answering (Geng et al., 27 Oct 2025, Yao et al., 2022).
Diagram-centric and chart-heavy tasks require models to parse tabular structures, align visual chart/table data with captions, and handle diagnostic visual reasoning (Ho et al., 13 Nov 2025, Wang et al., 18 Jun 2025).

4. Evaluation Protocols and Quantitative Findings

Evaluation employs accuracy, per-class/macro $F_1$ , precision/recall, and sometimes ROC/AUC metrics, tailored to class balance and diagnostic subtasks.

Observed Performance

State-of-the-art macro-F1 remains significantly below human accuracy: best models on scientific figures achieve $F_1\approx0.77$ (Lal et al., 5 Jun 2025), whereas domain-expert annotators reach $0.94$ (Wang et al., 18 Jun 2025).
Multi-hop reasoning causes substantial degradation: +4 hop claims cause a $\sim$ 20–27 pp gap between the best models (F1 $\sim$ 60–73%) and human annotators (F1 $\sim$ 80–90%) (Wang et al., 2024).
Evidence modality and format strongly affect results: models are far more robust to tables than charts, with table–chart macro-F1 gaps up to 23.3 points; humans are robust to such transpositions (Ho et al., 13 Nov 2025).
Pipeline improvements: integrating intermediate tasks (like context extraction) and high-quality evidence retrieval can yield +17 F1 points over single-task verdict prediction (Geng et al., 27 Oct 2025).

Diagnostic tasks such as panel localization and basic visual understanding reveal that current models are weak at fine-grained evidence identification and fail to integrate visual and textual cues effectively (Lal et al., 5 Jun 2025).

5. Model Limitations and Failure Modes

The principal failure modes include:

Inadequate evidence localization: inability to select relevant figure panels or regions (Lal et al., 5 Jun 2025).
Cross-modal integration weaknesses: models often rely on a single modality (text or image), showing neglect of complementary visual or textual information (Lal et al., 5 Jun 2025, Ho et al., 13 Nov 2025).
Multi-hop/reasoning breakdown: performance drops sharply as requisite evidence-item hops increase, with overconfidence and hallucination in complex inference chains (Wang et al., 2024).
Visual semantic misinterpretation: errors in chart reading, object detection, or scene understanding; table data is easier, but charts confound even large models (Ho et al., 13 Nov 2025).
Handling of nuanced contradictions: subtle epistemic or logical perturbations in claim text are often missed, and models demonstrate over-bias towards assigning “support” labels (Lal et al., 5 Jun 2025).
Biases in data/shortcuts: models exploit word- or image-similarity artifacts, length heuristics, or publisher-domain correlations that do not reflect genuine evidentiary reasoning (Gao et al., 2021, Chakraborty et al., 2023).

Qualitative error analysis also documents failures in temporal reasoning, aggregation across evidence, and calibration (high model confidence despite low actual accuracy in difficult conditions) (Wang et al., 2024).

6. Current Directions and Prospective Advances

Research directions proposed to address current bottlenecks include:

Enhanced Retrieval-Augmented Generation (RAG): Jointly retrain multimodal retrievers capable of handling texts, charts, tables, and images, and use LLM-based filters and rankers to select relevant evidence for the verifier (Wang et al., 18 Jun 2025).
Modality-Aligned Pretraining: Pretrain on combined chart/table/caption datasets to endow models with integrated visual–textual alignment, particularly targeting chart understanding (Ho et al., 13 Nov 2025).
Explicit Reasoning Scaffolds: Incorporate symbolic-neural hybrid systems, programmatic reasoning (ProgramFC), and chain-of-thought prompting to improve multi-hop and compositional inference (Wang et al., 2024).
Explainability and Diagnostics: Continue development of explainable frameworks leveraging fine-grained rationales (e.g., 5W QA, pixel-level heatmaps, diagnostic probe tasks) (Chakraborty et al., 2023, Lal et al., 5 Jun 2025).
Scalable, Multilingual, and Multicultural Benchmarks: Expand datasets to cover more languages, world regions, and evidence types (e.g., audio, video) (Geng et al., 27 Oct 2025).
Human-in-the-Loop and Robustness Paradigms: Integrate expert oversight into benchmarking and model correction, and systematically test under adversarial and domain-shifted conditions (Ho et al., 13 Nov 2025, Geng et al., 27 Oct 2025).

These directions are expected to move the field substantially beyond present benchmarks, which remain well below human expert performance in all but constrained settings.

7. Significance and Impact Across Domains

Multimodal claim verification underpins applications in scientific peer review, misinformation detection, news analysis, and social interaction analysis. In scientific contexts, models are now directly evaluated for their ability to judge claims against figures, captions, and text, intersecting with broader efforts to build “AI reviewers” (Lal et al., 5 Jun 2025, Wang et al., 18 Jun 2025, Ho et al., 13 Nov 2025). In societal settings, scalable tools are needed to counter complex multimodal disinformation and manipulation as evidenced by the FACTIFY 3M dataset and real-world, multilingual fact-checking pipelines (Chakraborty et al., 2023, Geng et al., 27 Oct 2025). Emerging directions like MIVA (Kang et al., 31 Oct 2025) open the frontier of integrating multimodal truth assessment into social reasoning and interactional intelligence.

In summary, multimodal claim verification is an intrinsically challenging, rapidly evolving benchmark for evaluating and advancing AI systems’ ability to perform fine-grained, reliable, and explainable reasoning over heterogeneous evidence, directly relevant to trustworthy information processing in scientific, journalistic, and social contexts (Lal et al., 5 Jun 2025, Kishore et al., 7 Aug 2025, Wang et al., 2024, Ho et al., 13 Nov 2025, Wang et al., 18 Jun 2025, Geng et al., 27 Oct 2025, Yao et al., 2022, Chakraborty et al., 2023, Kang et al., 31 Oct 2025).