Cohere-Bench: Visual Narrative Benchmark

Updated 25 January 2026

Cohere-Bench is a benchmark framework that evaluates generative multimodal language models on long-context, multi-entity visual narratives.
It employs automated tasks for story generation and continuation to measure semantic alignment, background consistency, style, and instance coherence.
Leveraging fine-grained annotations from Openstory++, the framework enables precise, quantitative evaluation of narrative and visual consistency.

Cohere-Bench is a benchmark framework developed for evaluating the capacity of generative models—specifically large multimodal LLMs (MLLMs)—to perform long-context, instance-aware image-text generation. Unlike classical multimodal evaluation sets focused on either single-turn image understanding or generation, Cohere-Bench systematically quantifies the ability of models to preserve entity consistency, background and style coherence, and textual alignment across temporally extended visual stories. Developed within the context of the Openstory++ dataset, Cohere-Bench provides a rigorous, fully automated pipeline for model comparison in the domain of open-domain visual storytelling, with carefully designed tasks and metrics targeting narrative and visual coherence at multiple granularities (Ye et al., 2024).

1. Motivation and Novelty

Prevailing multimodal benchmarks, such as OwlEval, LLaVA-Bench, MMMU, and Seed-bench-2, primarily assess performance on visual question answering, grounding, or single-step image generation. They lack provisions for evaluating:

Persistence of multiple entities (“instances”) through sequential context.
Consistency of visual background and artistic style across generated frames.
Maintenance of complex multi-instance interactions over stories of variable length.

Cohere-Bench uniquely addresses these limitations by providing an evaluation set sampled from Openstory++: a large-scale dataset with explicit instance-level annotations, high-resolution keyframes, and meticulously curated captions. The benchmark measures not only image-text alignment but also continuity and integrity of entities, rendering it the first comprehensive framework for long-context, multi-entity visual narrative evaluation in open-domain settings.

2. Task Structure

Cohere-Bench comprises two principal tasks, each evaluated under single-instance (“s”) and multi-instance (“m”) regimes:

A. Story Generation:

The model is conditioned on a text prompt for scene 1 ( $L^1$ ). For subsequent scenes, at turn $t$ , it receives all previously generated images ( $I^1, \dots, I^{t-1}$ ) plus a prompt $L^t$ , and generates frame $I^t$ . The goal is to systematically preserve subject identities, background, and style across all $M$ outputs.

B. Story Continuation:

The sequence is initialized with a ground-truth image $I^1$ and prompt $L^2$ . The model generates $I^2$ and continues as above for scenes $3, \dots, M$ .

All tasks are evaluated for both:

Single-instance: exactly one subject recurs across frames.
Multi-instance: two or more entities must be tracked through the sequence.

3. Dataset Design and Composition

The evaluation corpus for Cohere-Bench draws from a 1M-sequence subset of Openstory++. Key attributes are as follows:

Scale: 1600 total sequences (800 single-instance, 800 multi-instance), with sequence lengths $M$ distributed in $\{2,3,4,5\}$ .
Annotations: Each frame has a high-resolution image, human-refined caption, and per-instance bounding boxes and pixel masks.
Splitting: Cohere-Bench is exclusively an evaluation resource; no further train/validation/test partitioning is applied.
Instance Tracking: Recurring entities are explicitly annotated, supporting rigorous quantitative assessment of model ability to maintain identity and integrity over time.

Ground-truth data preparation includes both the original and post-processed forms (resized for model input, refined captions, etc.).

4. Evaluation Metrics and Methodology

Cohere-Bench introduces a fine-grained metric suite, evaluating both semantic and visual coherence:

Semantic Alignment (CLIPScore):

Measures similarity between the model’s output image and the prompt using CLIP-ViT-B-32 embeddings:

$\text{SemanticAlignment} = \mathrm{Sim}\bigl(\mathrm{CLIP}_\mathrm{img}(I^t),\,\mathrm{CLIP}_\mathrm{txt}(L^t)\bigr)$

Background Consistency:

Isolates and inpaints instance regions using YOLO-World, then computes CLIP-based similarity between scene backgrounds:

$\text{BgConsistency} = \frac{1}{M-1} \sum_{t=2}^{M} \mathrm{Sim}(\mathrm{CLIP}(\bar{I}^1), \mathrm{CLIP}(\bar{I}^t))$

Style Consistency:

Employs DINOv2-ViT-s16 to assess the similarity of stylistic features between consecutive frames:

$\text{StyleConsistency} = \frac{1}{M-1}\sum_{t=2}^M \mathrm{Sim}(\phi(I^{t-1}),\,\phi(I^{t}))$

Instance Consistency (single and multi):

For single-instance, the metric computes feature similarity for the single recurring entity; for multi-instance, the Hungarian algorithm matches corresponding entities:

$\text{InstCons}_m = \frac{1}{M-1}\sum_{t=2}^M\frac{1}{K}\sum_{k=1}^K \mathrm{Sim}\bigl(f^1_{k},\,f^t_{\pi_t(k)}\bigr)$

Instance Integrity:

Quantifies the fraction of an instance remaining visible and recognizable across frames:

$\text{InstIntegrity} = \frac{1}{K}\sum_{k=1}^K \mathrm{Sim}\bigl(f^1_k,\,f^t_{\pi_t(k)}\bigr)$

Textual Quality (BLEU4):

Generated frames are re-captioned with BLIP2, and corpus BLEU-4 is computed against the human-annotated captions.

$\mathrm{Sim}(\mathcal{F},\mathcal{G}) = \frac{1}{N}\sum_{i=1}^{N} \mathbf{f}_i \cdot \mathbf{g}_i$

where $\mathbf{f}_i$ and $\mathbf{g}_i$ are feature vectors derived from respective encoders.

5. Baseline Performance and Comparative Results

Results observed on Cohere-Bench highlight substantial performance gaps between current methods and instance-level coherence:

Model	SemAlign↑	BgCons↑	StyleCons↑	InstCons(s)↑	InstCons(m)↑	InstInt↑	BLEU4↑
DreamLLM	0.270	0.615	0.615	0.271	0.292	0.144	0.055
MiniGPT-5	0.209	0.634	0.214	0.214	0.219	0.115	0.011
SEED-X	0.272	0.775	0.762	0.744	0.774	0.421	0.057
Emu2	0.258	0.788	0.762	0.818	0.787	0.351	0.058
GPT-4V	0.286	0.762	0.781	0.753	0.761	0.424	0.062
MiniGemini	0.271	0.710	0.577	0.602	0.610	0.203	0.052
Ours (no visual anno)	0.254	0.748	0.766	0.693	0.696	0.383	0.054
Ours (w/ visual anno)	0.279	0.791	0.784	0.821	0.782	0.429	0.064

Instance-level visual supervision (“w/ visual anno”) consistently increases background, style, and entity coherence scores, illustrating the importance of granular visual annotation.

Ablation studies further show that refining captions with an LLM (BLIP2 raw vs LLM-refined) improves semantic alignment (0.228→0.262) and reduces perplexity (38.0→29.0). Incorporating visual annotations yields marked gains in instance consistency (0.693→0.821 single; 0.696→0.782 multi). All models degrade on style and instance metrics as sequence length increases, with significantly lower scores in multi-instance settings, revealing the increased challenge posed by multiple persisting entities (Ye et al., 2024).

6. Implementation and Automated Evaluation Pipeline

Cohere-Bench employs a fully automated process to ensure efficiency and objectivity:

Pretrained detectors (YOLO-World) and segmentation models construct instance and background masks for both ground-truth and generated frames.
Feature extraction is performed using CLIP and DINOv2 for semantic and style evaluation.
Entity matching across frames leverages the Hungarian algorithm to ensure one-to-one correspondence, central for instance integrity calculation.
Captioning (BLIP2) and subsequent LLM-based refinement operationalize text-image semantic assessment.
All metrics are computed over all 1600 evaluation stories, enabling robust per-model aggregate statistics.

This design permits repeatable, large-scale model comparison without human-in-the-loop evaluation.

7. Impact and Future Prospects

Cohere-Bench fills a substantial methodological gap in multimodal evaluation, providing the first rigorous suite for quantifying the capacity of generative models to maintain narrative and visual instance integrity in long-context, multi-entity settings. The experimental evidence demonstrates that instance-level supervision—enabled by the fine-grained labels within Openstory++—substantially improves model performance on background, style, and entity tracking metrics.

A plausible implication is that future models for open-domain visual storytelling will increasingly leverage similar benchmarks for both development and evaluation. As generative systems progress toward more sophisticated narrative capabilities, adoption of Cohere-Bench is expected to become standard for the community.

(Ye et al., 2024)

Markdown Report Issue Upgrade to Chat

References (1)

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cohere-Bench Framework.