MMFakeBench: Multimodal Misinformation Detection Benchmark

Updated 9 February 2026

MMFakeBench is a comprehensive benchmark that curates 11k mixed-source image–text pairs capturing textual, visual, and cross-modal distortions to detect misinformation.
It partitions data into four balanced categories (TVD, VVD, CCD, REAL) with both binary and 12-class annotations, ensuring realistic scenario evaluation.
Evaluation protocols using macro-Precision, Recall, F1, and accuracy reveal both the challenges for open-source detectors and improvements via agentic, retrieval-augmented methods.

MMFakeBench is a comprehensive multimodal misinformation detection benchmark developed to advance evaluation and research on mixed-source, real-world scenarios where claims pair text and image modalities. MMFakeBench stands distinct for explicitly modeling, curating, and annotating misinformation across three primary sources—textual, visual, and cross-modal consistency distortions—complemented by a taxonomy of 12 sub-categories. Designed for rigorous benchmarking of Large Vision–LLMs (LVLMs), single-source detectors, and agentic fact-checking systems, MMFakeBench has become a reference point in the study of semantic and artifact-driven misinformation detection in both zero-shot and modular inference regimes.

1. Benchmark Construction and Dataset Design

MMFakeBench comprises 11,000 image–text pairs, drawn from heterogeneous sources and processed via a protocol that ensures mixed-source coverage and factual grounding. Each sample consists of an image (natural, AI-generated, or manipulated using state-of-the-art diffusion models or Photoshop) and a textual claim (news headlines, captions, or tweet-style statements). The design reflects realistic digital misinformation scenarios, with four evenly balanced partitions: Textual Veracity Distortion (TVD, 30%), Visual Veracity Distortion (VVD, 10%), Cross-Modal Consistency Distortion (CCD, 30%), and Real pairs (30%) (Liu et al., 2024).

Annotation is performed via a combination of manual expert verification, external fact-checking sources, and protocolized QA for the source of non-authentic claims. Labels are provided at both the binary (FAKE/REAL) and fine-grained (12-class) levels; the latter captures not only whether a sample is fake, but the mechanism of the deception.

The three dominant forgeries are:

Textual Veracity Distortion (TVD): The text is false, supported by either an authentic (repurposed) or AI-generated image.
Visual Veracity Distortion (VVD): The text is true, but the accompanying image is manipulated, AI-generated, or contradicts the claim.
Cross-Modal Consistency Distortion (CCD): Both modalities are individually true, but their combination creates an inconsistent or misleading claim.

Twelve sub-categories encompass natural, artificial, and GPT-generated rumors in TVD; manual and generative manipulations in VVD; and nuances of repurposing and editing in CCD—ensuring coverage of a broad spectrum of contemporary fake creation pipelines (Liu et al., 2024).

2. Evaluation Protocols and Metrics

MMFakeBench evaluations are conducted in both four-way (REAL, TVD, VVD, CCD) and binary (REAL/FAKE) regimes. The primary metrics are macro-averaged Precision, Recall, F1 score, and Accuracy:

$P = \frac{1}{C} \sum_{c=1}^C \frac{TP_c}{TP_c + FP_c}, \qquad R = \frac{1}{C} \sum_{c=1}^C \frac{TP_c}{TP_c + FN_c}, \qquad F_1 = \frac{2 P R}{P + R}, \qquad \mathrm{Acc} = \frac{\sum_{c=1}^C TP_c}{N}$

where $C$ is the number of classes, and $TP_c$ , $FP_c$ , $FN_c$ denote class-wise correct and error counts (Liu et al., 2024).

For binary (FAKE/REAL) classification, the dataset enables standard accuracy, specificity, recall, precision, and rejection rate calculations as applied in LMM and agentic system studies (Kheddache et al., 26 Sep 2025, Shopnil et al., 20 Oct 2025).

3. Model Baselines and System Evaluations

MMFakeBench has become a critical testbed for comparing the core families of misinformation detection approaches:

A. Single-Source Detectors:

These include text-only (FakingFakeNews, RoBERTa on PROPANEWS), image-only (CNNSpot, UnivFD, LNP), and consistency checkers (HAMMER, FakeNewsGPT4). Pipeline-chained “Mixed Detection” of the top detectors yields only 47.6% binary accuracy and 22.5% macro-F1, highlighting their poor generalization to mixed-source setups (Liu et al., 2024).

B. Vision-LLMs (LVLMs):

Open-source models at the 7B–34B parameter scale (Otter-Image, InstructBLIP, LLaVA, BLIP-2, etc.) typically produce macro-F1 scores between 5–50%. Proprietary models show higher performance; GPT-4V (with standard prompting) achieves 74.0% binary accuracy and 61.6% macro-F1 (Liu et al., 2024). Sub-source analysis reveals that TVD (textual distortion) remains the most challenging for both open-source and closed-source LVLMs.

C. Agentic and Hybrid Approaches:

Systems such as MMD-Agent and MIRAGE introduce hierarchical reasoning and explicit module composition:

MMD-Agent uses sequential decomposed evaluation (text, visual, consistency) with intermediate rationales and structured action outputs. MMD-Agent on GPT-4V improves macro-F1 from 48.8% to 61.5%, and LLaVA-34B sees an increase from 25.4% to 47.7% (Liu et al., 2024).
MIRAGE adds retrieval-augmented verification with multi-hop web reasoning, outperforming all zero-shot baselines (e.g., MIRAGE achieves 81.65% F1 and 75.1% accuracy on 1,000-sample validation; 81.44% F1 and 75.08% accuracy on a 5,000-sample test subset) (Shopnil et al., 20 Oct 2025).

D. GPT-4o Prompt-Engineering Approach:

A structured six-question, chain-of-thought prompt yields moderate performance (67% accuracy, 83% recall, 51% specificity, 2% rejection rate on a 200-sample subset) with confidence reporting showing high confidence for “real” calls and medium confidence when labeling content as “fake” (Kheddache et al., 26 Sep 2025).

4. Analysis of Manipulation Types and Error Modes

Analysis across competing systems reveals that semantic/contextual manipulations (i.e., mismatched, out-of-context, or claim-distorting pairings) dominate MMFakeBench. Pixel-level artifact-based forgeries—face swaps, GAN/diffusion-generated objects—form a minority. Image-only deepfake detectors underperform (F1 ≤ 0.53), often classifying semantically misaligned but artifact-free pairs as “REAL,” misleading downstream inferences (Sagar et al., 2 Feb 2026).

Agentic and retrieval-based models demonstrate that semantic understanding and provenance, rather than low-level visual forensics, are primary drivers of claim verification performance (Shopnil et al., 20 Oct 2025, Sagar et al., 2 Feb 2026). Explicit web search, question generation, and deliberative alignment checks are essential for disambiguating subtle, claim-level misinformation. The import is that artifact cues should be treated as non-binding auxiliary signals; end-to-end semantic+retrieval reasoning is necessary for robust performance.

5. Prompting Strategies, Ablations, and System Architecture

Prompting and agentic decomposition greatly impact MMFakeBench performance:

Structured Prompts: Hierarchical, multitask prompts (as in the six-question GPT-4o schema or MMD-Agent) substantially outperform monolithic classification.
Module Ablations: MIRAGE ablation studies show that removing visual verification reduces F1 by 5.18 points; removing retrieval-augmented grounding reduces F1 by 2.97 points. Pure judge-only systems (“text headline only”) exhibit class-imbalance exploitation (F1 = 82.74%, Acc = 70.8%, FP Rate = 97.3%) but are not robust (Shopnil et al., 20 Oct 2025).
Confidence and Variability: Confidence reporting reveals a bias toward “safe” real labels. No explicit MMFakeBench variability scores are reported, though elsewhere in the literature instance-level prediction variability between runs reaches 11–12% (Kheddache et al., 26 Sep 2025).
Fusion Strategies: Late fusion approaches (unimodal model ensembling) are outperformed by cross-modal or retrieval-augmented designs (Liu et al., 2024, Shopnil et al., 20 Oct 2025).

6. Research Directions and Limitations

Ongoing limitations and unresolved challenges include:

External Knowledge Limitations: Retrieval-augmented models are limited by source recall, coverage, and context collapse in web search.
Sample Scope: MMFakeBench currently addresses only paired image–text claims; there is an explicit need to extend to video, audio, and social media threads (Liu et al., 2024).
Annotation Granularity & Robustness: Research directions call for annotation not just of deception labels, but the type and region of manipulation for future model explainability; combined with scaling via crowdsourcing for real-time verification (Liu et al., 2024).
Generalization: Open-source models, even at scale (34B), lag behind closed-source counterparts, especially on complex textual or compositional manipulations.
Agentic Reasoning Calibration: Future research is directed toward improved fusion of subtask outputs, confidence calibration, multistep reasoning, and incorporation of web-browsing toolkits.

The corpus continues to evolve, with recommendations for adversarial and continual learning protocols, region-level explanation benchmarks, and expansion to richer, multi-turn evidence flows (Liu et al., 2024).

7. Significance and Positioning Amongst Benchmarks

MMFakeBench is distinct from pure artifact-centric (e.g., DeepfakeBench-MM (Zhao et al., 26 Oct 2025)) or explainability-driven (e.g., FakeBench (Li et al., 2024)) corpora. It occupies the intersection of claim-level semantic reasoning, mixed-source misinformation, and multimodal entailment, and—by virtue of its taxonomy, size, and heterogeneity—acts as a de facto standard for evaluating the next generation of LVLM and AFC frameworks in the misinformation detection domain (Liu et al., 2024, Shopnil et al., 20 Oct 2025, Kheddache et al., 26 Sep 2025).

A plausible implication is that advances in MMFakeBench generalization and source-aware reasoning architectures will accelerate progress across downstream tasks, including explainable fake detection, external knowledge integration, and cross-modal misinformation beyond static image–text scenarios.