VERITE Benchmark for Multimodal Misinformation

Updated 13 July 2025

VERITE Benchmark is a rigorously engineered evaluation suite that balances image-text data to eliminate unimodal shortcuts in misinformation detection.
It integrates the CHASMA synthetic data generation process to simulate real-world misleading associations and drive performance gains.
It also functions as a profit-centric fuzzing benchmark for smart contracts, unveiling vulnerabilities with advanced exploitation techniques.

The VERITE Benchmark is a rigorously engineered suite for the evaluation and advancement of multimodal misinformation detection (MMD), addressing fundamental issues in the empirical assessment of multimodal reasoning. Developed to overcome the limitations of unimodal bias found in prior datasets, VERITE offers real-world grounded, modality-balanced test cases, a sophisticated synthetic data generation framework (CHASMA), and has catalyzed methodological and architectural innovations in MMD. Additionally, in an orthogonal research context, VERITE appears as a profit-centric fuzzing benchmark and exploitation suite for smart contracts, though this usage is distinct from the multimodal misinformation domain. The following sections provide an in-depth exploration of VERITE’s conception, construction, experimental protocols, impact, and technical details.

1. Motivation and Problem Formulation

The primary impetus for VERITE stems from pervasive unimodal bias observed in existing multimodal misinformation detection benchmarks such as VMU-Twitter and COSMOS. In these datasets, models relying on a single modality (either images or text) can often match or exceed the performance of fully multimodal models, undermining the core crossmodal reasoning objective inherent to MMD. This issue arises from patterns where misinformation is transmitted predominantly through only one channel, or from data construction artifacts that inadvertently expose shortcuts. Consequently, progress measured on such benchmarks may not reflect genuine advances in multimodal inference.

VERITE (Verification of Image-TExt pairs) is designed to address these shortcomings by:

Excluding “asymmetric multimodal misinformation,” where only one modality contributes to meaning or deception.
Enforcing “modality balancing,” ensuring each image and each caption appear in both truthful and misleading contexts.
Drawing exclusively from real-world, fact-checked incidents, thus aligning experimental evaluation with practical deployed systems (Papadopoulos et al., 2023).

2. Methodology and Construction

The creation of VERITE follows a multi-step, manually curated process grounded in authenticity and careful curation:

Data Collection: Fact-checked articles from sources such as Snopes and Reuters provide genuine image-caption pairs $(I^{t}_i, C^{t}_i)$ . For each case, a misleading variant $(C^{f}_i)$ is extracted directly from the associated fact-check entry. To construct out-of-context (OOC) image pairs $(I^{x}_i)$ , entity keywords from the golden caption are used as search queries to retrieve plausible but unrelated images.
Exclusion of Asymmetric-MM: Each candidate pair is subject to an explicit check; misleading captions must materially misrepresent their paired images, ruling out cases where the image or caption alone suffices for detection.
Modality Balancing: Every image and every caption is guaranteed to appear twice: once in a truthful context, once as misleading. Formally, each $I^t_i$ is used once with $C^t_i$ (“True”) and once with $C^f_i$ (“Miscaptioned”), mirroring this process for captions and ensuring balance across the dataset.
OOC Pairs: OOC pairings further enhance the challenge by incorporating plausible but misleading associations that dummy any superficial unimodal cues.

This meticulous methodology ensures that MMD models must leverage the alignment or incongruity between image and text to succeed, eliminating spurious correlations exploitable by unimodal approaches (Papadopoulos et al., 2023).

3. Comparative Evaluation and Performance Benchmarks

A robust comparative analysis was conducted using Transformer-based detectors built atop CLIP ViT-L/14 encoders. The paper demonstrates:

On VMU-Twitter, image-only models significantly outperform multimodal ones (e.g., $D^-(I)$ : 83.7% vs $D(I,C)$ : 79.7%), revealing visual-side bias.
On COSMOS, text-only models are best, revealing text-side bias.
On VERITE, fully multimodal models outperform both unimodal variants by margins of 27–43%. For instance, the inclusion of both modalities forces the detector to reason jointly, as illustrated by substantial gains in multiclass accuracy (a 9.22% improvement when training with CHASMA-augmented data).

These results underscore that VERITE eliminates the possibility of unimodal shortcuts and provides a stringent testbed for real progress in MMD (Papadopoulos et al., 2023).

4. Synthetic Data Generation with CHASMA

The CHASMA (Crossmodal HArd Synthetic MisAlignment) process augments the original dataset with challenging synthetic examples:

Using pretrained CLIP, embeddings for authentic pairs $(V^I, T^C)$ are computed.
For a given image, a plausible but misleading caption $C_j^f$ is selected from a pool of human-written claims using a dual similarity criterion: with probability $p\leq0.5$ CLIP text-to-text similarity is used, otherwise image-to-text similarity is employed:

$C_j^f = \arg\max_{C_j^f \in \mathcal{C}_F} \begin{cases} \text{sim}(T_{C^t_i}, T_{C^f_j}) & p \leq 0.5 \ \text{sim}(V_{I^t_i}, T_{C^f_j}) & p > 0.5 \end{cases}$

where $\text{sim}(\cdot, \cdot)$ is cosine similarity.

Empirical incorporation of CHASMA yields up to a 9.2% accuracy improvement over NEI-based baselines by fostering the deeper cross-modal associations critical to robust MMD (Papadopoulos et al., 2023).

5. Technical Implementation and Architectural Overview

VERITE’s recommended implementation employs a two-stage architecture:

Encoders: CLIP ViT-L/14 serves as both image and text encoder ( $E_I(\cdot)$ and $E_C(\cdot)$ ), each producing $m=768$ dimensional embeddings.
Fusion and Classification: Embeddings are concatenated and passed through a Transformer encoder; classification utilizes:

$y = W_1 \cdot \operatorname{GELU}(W_0 \cdot \operatorname{LN}( D([V_I, T_C]) ))$

where $W_0 \in \mathbb{R}^{m \times 2}$ (multimodal) and $W_1 \in \mathbb{R}^{n \times (m/2)}$ for $n=1$ (binary) or $n=3$ (multiclass).

Training Protocols: Hyperparameters (number of layers $L \in \{1,4\}$ , attention heads $h \in \{2,8\}$ , feed-forward dim $f \in \{128,1024\}$ ) are tuned per dataset size, with scripts provided for full replication.

The complete implementation and data-processing scripts are available at https://github.com/stevejpapad/image-text-verification. Auxiliary datasets (VisualNews, Fakeddit, NC-t2t, R-NESt, CLIP-NESt) are referenced for broader experimentation (Papadopoulos et al., 2023).

6. Influence, Usage, and Methodological Impact

VERITE has influenced both evaluation protocols and model architectures in the paper of multimodal misinformation:

It is used as the principal benchmark in subsequent fact-checking systems including RED-DOT (“Relevant Evidence Detection Directed Transformer”) (Papadopoulos et al., 2023), which achieves up to a 33.7% accuracy boost over prior state-of-the-art using evidence re-ranking and elementwise modality fusion.
Simpler approaches, such as MUSE-based classifiers, match or exceed sophisticated architectures on VERITE by exploiting similarity features, raising pivotal questions about reliance on retrieval shortcuts versus true factual inference (Papadopoulos et al., 18 Jul 2024).
In the domain of smart contract security, an unrelated but homonymous benchmark and fuzzer named VERITE targets profit-centric vulnerability detection, employing DeFi action-based mutations, anomaly-driven candidate selection, and stochastic gradient descent maximization to find and exploit vulnerabilities in on-chain contracts (Kong et al., 15 Jan 2025). Its efficacy is demonstrated by extracting more than $18M in total profits across 61 real-world DeFi exploits and outperforming ItyFuzz in both detection (29 vs 9) and exploitation (up to 134-fold higher profits).

7. Limitations and Future Directions

While VERITE addresses key flaws in previous MMD evaluation, several challenges and opportunities remain:

The modality-balancing approach may under-represent naturally occurring asymmetric cases, which could matter in some real-world misuse detection settings.
Current methods (including approaches evaluated on VERITE) can still be vulnerable to shortcut-based strategies if evidence retrieval introduces strong bias or “leakage.” This suggests a need for further improvements in evidence collection and negative sample construction (Papadopoulos et al., 18 Jul 2024).
Detailed annotation and scenario expansion, as well as integration with external knowledge or factuality checking, are cited as critical future directions.
In the fuzzing context, future VERITE improvements are envisioned to include automated action template extraction (possibly with LLMs), tighter integration with symbolic reasoning, and proactive on-chain defense mechanisms (Kong et al., 15 Jan 2025).

Table: Key Characteristics and Applications of the VERITE Benchmark

Aspect	Multimodal Misinformation Detection	Smart Contract Vulnerability Fuzzing
Application Domain	Image-text misinformation (social media)	DeFi smart contracts (on-chain finance)
Modality Bias Solution	Exclusion + Modality Balancing	N/A (profit-centric candidate selection)
Data Source	Real-world, fact-checked articles	Real-world exploited projects (61 DeFi)
Public Code	Yes, CLIP-based Transformer + CHASMA	Yes, profit-maximizing fuzzer + optimizer
Notable Downstream Use	RED-DOT, MUSE, AITR, NewsCLIPpings+	Agentic LLM exploit generation (Gervais et al., 8 Jul 2025)

Summary

The VERITE Benchmark establishes a new empirical standard for evaluating multimodal misinformation detection by rigorously eliminating unimodal shortcuts, integrating realistic and challenging data, and inspiring advances in both architectural design and dataset methodology. In the field of smart contract security, VERITE also denotes a separate, profit-centric fuzzing benchmark. Across these applications, VERITE’s technical rigor and empirical impact provide researchers with a robust testbed for evaluating and advancing the true state-of-the-art in crossmodal inference and security-critical vulnerability detection.