OpenEvents V1 Benchmark

Updated 29 December 2025

OpenEvents V1 Benchmark is a comprehensive dataset engineered for event-centric vision–language tasks through narrative-style captions linking images to rich contextual events.
It employs dual tasks—event-enriched image captioning and event-relevant image retrieval—to capture temporal, causal, and semantic aspects of news events.
Standardized evaluation protocols, rigorous annotation pipelines, and extensive baselines make it a robust testbed for advancing multimodal research.

OpenEvents V1 is a large-scale benchmark dataset designed to advance event-centric vision–language (V–L) understanding by focusing on contextual and temporal event grounding. Unlike traditional V–L benchmarks that emphasize surface-level captions or keyword-based retrieval, OpenEvents V1 supports two primary tasks: event-enriched image captioning and event-relevant image retrieval. The dataset comprises over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian between 2011 and early 2025, spanning diverse domains. It provides standardized evaluation protocols, extensive baselines, and rigorous annotation pipelines, aiming to establish a robust foundation for research in multimodal models capable of deep reasoning over real-world events (Nguyen et al., 23 Jun 2025).

1. Motivation and Positioning

Conventional V–L benchmarks such as MS COCO, Flickr30K, and WIT focus primarily on direct visual description (e.g., “a man riding a bicycle”) or shallow retrieval strategies centered on keyword matching. These benchmarks do not address models’ ability to reason about event participants, causality, temporal context, or outcomes, which are crucial for real-world applications such as journalism, historical archives, disaster response, and media monitoring.

OpenEvents V1 specifically addresses this gap by encouraging systems to perform “event grounding”—that is, the linking of images to their narrative context, named entities, temporal markers, causes, and consequences. Its narrative-style captioning and event-targeted retrieval tasks are designed to directly reward multimodal, context-aware reasoning that goes beyond visual appearance and toward a deeper semantic and temporal understanding of events.

2. Dataset Composition and Annotation Methodology

Scale and Source Distribution

OpenEvents V1 aggregates a total of 202,803 news articles and 415,324 images across two major news outlets:

Source	Articles	Images	Time Range
CNN	24,200	89,596	2011–2022
The Guardian	178,603	325,728	2019–2025

The dataset covers a broad range of domains including news, sports, politics, health, lifestyle, arts, and culture. The temporal span extends continuously from 2011 through early 2025, resulting in wide event and entity coverage.

Annotation Schema and Caption Structure

Each image is paired with a narrative-style caption, typically 25–40 words, constructed to answer the classic journalistic questions: who, what, when, where, why, and how. Captions explicitly incorporate:

Named entities (persons, organizations)
Temporal markers (dates, periods)
Spatial references (locations, venues)
Causality and outcomes

The annotation process utilizes a multi-stage “Human-Agentic Framework”:

Dense visual description via the open-source Molmo model.
Contextual question generation (covering who, what, when, where, why).
Evidence extraction from article text.
Narrative caption synthesis integrating visual and textual cues.
Human review and refinement for factual consistency and contextual depth.

Quality control includes a final human review stage with editing or removal of captions containing factual or naming errors.

3. Task Structure

3.1 Event-Enriched Image Captioning

Given an image $I$ (and, optionally, its paired article text $A$ ), the goal is to generate a caption $C$ that richly describes the depicted event:

$C^* = \arg\max_C \log P(C \mid I, A; \theta)$

where $\theta$ denotes model parameters. Two regimes are defined: image-only (no article context) and image + article (full contextual grounding).

3.2 Event-Relevant Image Retrieval

Given a narrative-style textual query $Q$ (mirroring an event description), systems must retrieve a ranked list of images $\{I_1, \ldots, I_K\}$ from the full database:

$s(I \mid Q) = \text{sim}(f_{\text{text}}(Q), f_{\text{vis}}(I))$

A two-stage retrieval option exists: first retrieving the most relevant article, then ranking its associated images. This approach allows more precise alignment between narrative and visual evidence.

4. Data Splits and Evaluation Protocols

The benchmark prescribes the following splits and protocol:

Split	Image–Caption Pairs	Notes
Train	21,904	Model fitting
Public Test	6,000	Fully public for evaluation
Private Test	4,000	Held out for EVENTA 2025 competition
Retrieval Database	415,324 images	All images, plus 202,803 article texts

No cross-validation protocol is defined: models are expected to train on the official split and report results on the public/private tests. The full retrieval scenario leverages the entire dataset for candidate selection.

5. Evaluation Metrics

5.1 Captioning

Metrics assess both n-gram overlap and semantic fidelity, including:

BLEU- $n$ , $n=4$ :

$\text{BLEU} = \exp\left(\sum_{i=1}^n w_i \log p_i\right),\quad \sum w_i = 1$

METEOR:

$F_{\mathrm{mean}} = \frac{10PR}{R+9P}, \quad \text{METEOR}=F_{\mathrm{mean}}\times(1-\gamma p_m^\beta)$

ROUGE-L: F-score over LCS
CIDEr: TF–IDF-weighted n-gram consensus

$\text{CIDEr}(c, \{s_j\}) = \frac1N \sum_{n=1}^N w_n \frac{g_n(c)\cdot g_n(s_j)}{\|g_n(c)\|\|g_n(s_j)\|}$

SPICE: Scene-graph tuple alignment score

5.2 Retrieval

Retrieval efficacy is measured via:

Recall@K:

$\text{R@}K = \frac{1}{|Q|}\sum_{q\in Q} \mathbf{1}(\mathrm{rank}_q \le K)$

Mean Average Precision (mAP):

$\mathrm{mAP} = \frac{1}{|Q|}\sum_{q\in Q}\sum_{k=1}^N P_q(k)\Delta r_q(k)$

NDCG@K: Normalized discounted cumulative gain
Nearest-Neighbor Accuracy (NN): Approximately Recall@1
AUC: Area under the precision–recall curve

6. Baseline Systems and Results

Captioning Baselines

Three models were evaluated: SmolVLM, Qwen2.5-3B, and Gemma-3-4B. Two pipelines (image-only and image+article) were compared. The “+Article” regime consistently outperforms image-only counterparts, as shown below for the public test split:

Model + Context	CLIPScore	CIDEr	BLEU-4	METEOR
Gemma + Article	0.6634	0.0184	0.0341	0.1453
Qwen + Article	0.5855	0.0565	0.0419	0.1383

Adding article context yields 2–4x improvement in CIDEr and BLEU, highlighting the necessity of external grounding.

Retrieval Baselines

Direct V–L alignment models (CLIP, OpenCLIP) are compared against article-guided pipelines and hybrid reranking methods:

Retrieval Pipeline	mAP	NDCG	NN	AUC
CLIP (direct)	0.2467	—	—	—
OpenCLIP (direct)	0.1845	—	—	—
SBERT + BART + CLIP (hybrid)	0.3232	0.3978	0.2226	0.0436
SBERT + Pegasus + CLIP (hybrid)	0.3216	0.3986	0.2173	0.0450

The strongest alignment is achieved via two-stage article retrieval followed by CLIP reranking of top images.

7. Observations, Error Modes, and Research Opportunities

Empirical results underscore the importance of contextual grounding: systems relying solely on image inputs underperform substantially, while article conditioning yields marked performance gains. News event descriptions contain critical details (dates, entities, causal relations) absent from visual data, necessitating retrieval-augmented or multimodal LLM architectures.

Notably, even the best-performing baselines yield low absolute CIDEr (< 0.06) and modest mAP (< 0.33), indicating considerable headroom for innovation. Common failure modes include hallucinated entities, misaligned temporal markers, and insufficient linkage of textual referents to visual evidence.

Directions for further research include pretraining multimodal transformers on event-centric corpora, improved vision–text fusion for long context, joint event extraction and grounding (spanning QA and timeline construction), and fine-grained scene-graph alignment to support SPICE-based evaluation (Nguyen et al., 23 Jun 2025).

OpenEvents V1 provides a large-scale, richly annotated testbed for research advancing from visual description toward event understanding—emphasizing not just “what it looks like,” but “what happened, why, and who was involved.”

PDF Markdown Chat (Pro)

References (1)

OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to OpenEvents v1 Benchmark.