Factify 2: Multimodal Fact Verification Dataset

Updated 10 August 2025

Factify 2 is a comprehensive dataset containing claim–document pairs with textual, visual, and OCR evidence for automated fact verification.
The dataset employs a detailed annotation scheme with five classes to capture support, insufficient evidence, and refutation across modalities.
Baseline models using transformer and vision architectures achieve competitive macro-F1 scores, setting a benchmark for multimodal fake news detection.

Factify 2 is a large-scale multimodal fact verification dataset and benchmark designed for the detection of fake news and automated claim verification using both textual and visual evidence. Factify 2 advances the state of multimodal fact-checking through its carefully structured annotation scheme, diverse source inclusion—including satire—and the provision of strong, openly available baselines. It underpins recent advances in multimodal fake news detection and has catalyzed a series of competitive shared tasks, driving algorithmic and systems-level research in automated veracity assessment.

1. Dataset Structure and Label Taxonomy

Factify 2 is constructed using claim–document pairs, each containing both text and corresponding images, along with associated OCR data extracted from images when present. Each sample consists of:

A claim, typically mimicking a social media post, with textual content and an associated image.
A supporting document (news article or context-providing post) containing its own text and image.

The labeling schema comprises three principal classes—Support, No-Evidence (termed "Insufficient"), and Refute—each further subdivided based on modality:

Main Class	Subclass	Criteria
Support	Support_Text	Document text supports the claim, regardless of image evidence
	Support_Multimodal	Both text and image in the document support the claim
No-Evidence	Insufficient_Text	Document fails to provide sufficient textual support or contradiction
	Insufficient_Multimodal	Neither text nor image provides decisive support, but images may exhibit similarity
Refute	Refute	Text and/or images in the document explicitly contradict the claim

This results in five distinct annotation classes for the task, capturing nuanced cross-modal relationships essential for real-world misinformation scenarios (Suryavardan et al., 2023, Suryavardan et al., 2023).

2. Data Sources, Diversity, and Expansion

Factify 2 includes 50,000 annotated claim–document pairs, a significant expansion from its predecessor. The data acquisition pipeline covers:

Real news from reputable Twitter handles of major outlets (spanning India, the USA, etc.).
Fact-checking websites such as Snopes, Factly, and Boom.
Satirical articles, specifically introduced to increase challenge complexity and realism by mimicking authentic journalistic style in fake contexts.

The inclusion of satirical content is novel in comparison to prior datasets, compelling models to discern factual intent beyond superficial text-image congruence. The data is balanced across all five classes (train:validation:test ratio 70:15:15) (Suryavardan et al., 2023, Suryavardan et al., 2023).

During collection, visual entailment between claim and document images is determined using a combination of cosine similarity on ResNet50 image embeddings and histogram similarity with carefully selected thresholds (Suryavardan et al., 2023).

3. Baseline Models and Technical Implementations

The Factify 2 release provides openly available strong baselines rooted in state-of-the-art architectures:

Text Encoding: Sentence BERT (SBERT) using the stsb-mpnet-base-v2 variant is applied to both claim and document textual information to obtain dense text embeddings.
Image Encoding: Vision Transformer (ViT) models, and alternatively ResNet50, extract high-level visual features.
Fusion and Classification: The concatenated text and image embeddings for each sample,

$F = [T; I]$

are processed via a multilayer perceptron (MLP), whose softmax output yields probability distributions over the five classes:

$y = \mathrm{softmax}(\mathrm{MLP}(F))$

The baseline model using ViT + SBERT-MPNet achieves a macro-F1 of approximately 0.65, surpassing alternatives that pair ResNet50 with SBERT-RoBERTa. These results establish a challenging reference point for subsequent methods (Suryavardan et al., 2023).

Cutting-edge submissions to the associated shared task demonstrate rapid progress: the best performing system achieved a weighted F1 of 81.82% by combining DeBERTa (text) and SwinV2/CLIP (images) with sophisticated fusion and ensemble techniques (Du et al., 2023, Suryavardan et al., 2023).

4. Methodological Innovations in Benchmarking

Factify 2's formulation requires joint reasoning over text and images, enabling multimodal entailment as the central verification paradigm:

Systems must model not only local alignment (e.g., lexical or visual similarity) but also global narrative coherence across modalities.
Benchmark participants have leveraged transformer-based LLMs, parameter-efficient vision transformers, and cross-modal attention mechanisms to capture subtle interactions between modalities (Du et al., 2023, Verschuuren et al., 2023, Zhang et al., 2023, Kishore et al., 7 Aug 2025).
Explicit structural features (such as sentence length, ROUGE overlap, and cosine similarities of deep representations) combined with classical classifiers (e.g., random forests) have been shown to provide complementary accuracy gains (Zhang et al., 2023).
Ablation studies confirm that cross-modal co-attention and ensemble aggregation across model variants deliver significant improvements over unimodal or naïvely fused approaches (Du et al., 2023).

5. Societal and Algorithmic Impact

Factify 2 facilitates algorithmic advances with high applied relevance:

Multimodal Fact Verification: The dataset formalizes multimodal entailment, now a central target in fake news detection research facing increasingly sophisticated misinformation involving manipulated imagery and complex cross-domain cues.
Automatic Satire Detection: The inclusion of satire presents a heightened challenge and reflects deployment realities, where intent is not always immediately recoverable by surface-level analysis.
Model Evaluation: The weighted F1 score across all five classes is adopted as the central metric, emphasizing balanced precision and recall—crucial given subtle inter-class ambiguities, especially between insufficient and support classes.
Open Science: Both data and baseline code are available, accelerating reproducibility and comparative evaluation (Suryavardan et al., 2023).

6. Relation to Broader Fact-Checking Benchmarks

Factify 2 represents an evolution from primarily text-based datasets such as FEVER and LIAR to deeply multimodal, context-rich resources. Unlike FEVER, which focuses on textual entailment with Wikipedia as the evidence base, Factify 2 demands modeling information from both text and visuals, as found in contemporary social/news media (Suryavardan et al., 2023, Suryavardan et al., 2023). More recent datasets such as FACTIFY 3M extend this direction to even larger scales (3 million samples) and integrate explainability features such as 5W QA pairs and pixel-level heatmaps (Chakraborty et al., 2023).

Comparatively, Factify 2’s challenge is marked by balanced class structure, explicit incorporation of satire, strong baselines, and integration of both modern transformer and vision architectures. It forms the basis for a series of annual shared tasks with rapidly rising upper-bound performance—driven largely by innovations in cross-modal reasoning and fusion.

7. Availability and Continuing Relevance

The Factify 2 dataset and codebase are accessible at:

https://github.com/surya1701/Factify-2.0

This ensures continued uptake, broad benchmarking, and integration into emerging research on scalable, interpretable, and robust multimodal fact verification in a setting that closely mimics real-world misinformation challenges (Suryavardan et al., 2023, Suryavardan et al., 2023).