LLaVA-Interleave Bench: Multi-Modal Evaluation

Updated 9 September 2025

LLaVA-Interleave Bench is a unified framework that assesses large multimodal models across multi-image, video, and 3D scenarios with interleaved image-text inputs.
It simulates real-world use cases by integrating static, temporal, and spatial modalities, offering both in-domain and out-domain evaluations with rigorous quantitative metrics.
The benchmark drives innovations in model architecture and training strategies, highlighting trade-offs and emergent cross-modal transfer capabilities.

The LLaVA-Interleave Bench is a comprehensive evaluation framework developed to assess the multi-image, multi-frame (video), and multi-view (3D) capabilities of Large Multimodal Models (LMMs) in interleaved image–text scenarios. It was introduced as part of the LLaVA-NeXT-Interleave initiative, extending prior single-image LMM testing to support richer, real-world settings where models must jointly process and reason over complex sequences of visual and textual data (Li et al., 10 Jul 2024). The benchmark spans a diverse suite of tasks, domains, and modalities, providing a standardized ground for comparative analysis and driving advancements in multi-image multimodal AI.

1. Design Principles and Objectives

The LLaVA-Interleave Bench was constructed to measure instruction-following and in-context multi-image reasoning. Its primary goals are:

To emulate practical use-cases where multiple images and accompanying textual instructions must be processed together.
To challenge LMMs with both in-domain (familiar from training) and out-domain (unseen scenarios) tasks, testing both competence and generalization.
To encompass not only static images but also temporal (multi-frame video), spatial (multi-view 3D), and single-image multi-patch modalities, thereby building a cross-modal evaluation template.

The interleaved format arranges blocks of image tokens intertwined with textual tokens, simulating dialogues, visual stories, search and comparison, captioning, and scientific reasoning—thus saturating the benchmark with context-rich intermodal signals.

2. Task Taxonomy and Data Composition

The suite is organized into a dual-split evaluation protocol:

In-domain Evaluation (≈12.9K samples): Includes datasets and tasks present during model training. Scenarios comprise Spot the Difference, Visual Storytelling, OCR-based Q&A, multi-image VQA (NLVR2, MIT-States, recipes), image editing instructions (HQ-Edit, MagicBrush, IEdit), and puzzle tasks (Raven).
Out-domain Evaluation (≈4.1K samples): Challenges models with novel unseen tasks such as MathVerse-mv, SciVerse-mv, Mantis-Eval, BLINK, and MMMU-mv, focusing on multi-image scientific reasoning and mathematical problem-solving.

Additionally, the LLaVA-Interleave Bench includes:

Multi-frame (Video): Tasks on NExT-QA, STAR, ShareGPTVideo, probing temporal comprehension, consistency, and detailed captioning.
Multi-view (3D): Usage of datasets like nuscenes (autonomous outdoor scenes), ScanQA, and 3D-LLM for spatial and perspective-based evaluation.
Single-image Tasks: A holdout split preserves single-image evaluation to track if multi-image training degrades baseline single-image performance.

All tasks are annotated with quantitative metrics and quality criteria. Example table (abbreviated):

Split	Scenario	Example Datasets
In-domain	Spot the Difference	RealDiff, SynthDiff, Birds, Solids
Out-domain	Math/Science Reasoning	MathVerse-mv, SciVerse-mv, MantisEval
Multi-frame	Video VQA/Captioning	NExT-QA, STAR, ShareGPTVideo
Multi-view	3D Understanding	nuscenes, ScanQA, 3D-LLM
Single-img	Baseline tasks	Random 40% FT data

3. Metrics, Scoring, and Analytical Protocols

Performance in LLaVA-Interleave Bench is quantified using a mixture of accuracy, precision, recall, F₁, ROUGE-L, and composite metrics per scenario.

For Spot the Difference and comparison tasks, average correctness and semantic understanding are measured.
Video benchmarks add dimensions for Detail Orientation, Temporal Understanding, Consistency, and Context.
Tasks such as multi-image VQA and document understanding use respective specialized metrics (e.g., NLVR2 accuracy, OCR-VQA accuracy, SlideVQA score).
Composite metrics are sometimes aggregated over multiple criteria, including object localization, reasoning, dialogue fluency, and perceptual accuracy.
For video frame pooling, token count is computed as $T = N_{frames} \times N_{spatial} \times \lambda_{pool}$ , with ablation studies examining trade-offs between spatial granularity and computational cost.

4. Underlying Model Architectures and Training Regimes

The principal model developed for this benchmark, LLaVA-NeXT-Interleave (Li et al., 10 Jul 2024), leverages:

A state-of-the-art vision backbone (e.g., SigLIP-400M or CLIP-ViT-L/336px) with high-resolution multi-aspect support (672×672, 336×1344, 1344×336).
An intermediate projection module (typically a two-layer MLP) aligning visual tokens to the LLM’s embedding space.
A LLM backbone (Qwen 1.5, Vicuna, or equivalent) adaptable to multi-modal conditioning.
Mixed data formats for interleaved input (images “in-the-front” or token interleaving within text).
A “pooling” strategy for video tokens, aggregating spatial features to reduce input length while preserving temporal context: e.g., $40 \times 729 \times 1/4 = 10 \times 729$ .
Optional integration of plug-and-play modules such as Dense Channel Integration (DCI) (Cuong et al., 13 Jun 2025), aggregating features across backbone layers to deepen semantic coherence and support structured scene understanding.

Ablations in the benchmark probe the effect of checkpoint warmstarting (from single-image LLaVA-NeXT), input format (token placement), and frame sampling/aggregation.

It has been observed that instruction tuning on M4-Instruct (1,177.6k samples across 41 datasets) and interleaved formats enables surprising emergent skills:

Transfer of single-image reasoning expertise to multi-image tasks without direct instruction.
Cross-modal generalization, e.g., text generation skills learned in Twitter multi-image tasks transferring to video captioning.
Successful handling of previously untrained real-world applications such as painting style recognition, slide summarization (PPTs), multi-document VQA, and multi-view spatial reasoning.
Retention of single-image performance following multi-image fine-tuning, with ablation studies confirming that the model does not sacrifice classic VQA and scene understanding capabilities.

6. Comparative Evaluations and Trade-offs

Comparative analysis between standard LLaVA-NeXT-Interleave and DCI-enhanced variants (Cuong et al., 13 Jun 2025) highlights several trade-offs:

Standard fine-tuned models achieve higher accuracy on vision-dominated tasks (VISION, NLVR2, Fashion200K).
DCI-equipped models demonstrate gains in document understanding (SlideVQA) and semantic change detection (MIT-States_PropertyCoherence), but may show greater training variance and less robustness in dialogue-heavy, open-response settings.
Both approaches outperform previous single-image baselines (e.g., average multi-image reasoning accuracy improves from 77.71% to 86.06% post-fine-tuning).
Early convergence speed is increased with DCI fusion, but susceptibility to optimization fluctuations arises at late training stages.

7. Implications and Outlook

The LLaVA-Interleave Bench stands as a pivotal resource for unified multi-modal learning research, guiding:

Analysis and optimization of model architectures suited for handling interleaved, multi-modal input—informing choices on backbone, projection, pooling, and fusion mechanisms.
Evaluation of plug-and-play enhancements such as attention-based fusion modules, latent text-guided guidance (TG-LLaVA (Yan et al., 15 Sep 2024)), and knowledge distillation strategies (LLaVA-MoD (Shu et al., 28 Aug 2024)) for efficient, robust, and scalable deployment.
Exploration of cross-domain and cross-modal curriculum learning, dataset diversity, and progressive sampling strategies to maximize model generalization and real-world applicability.
Continual investigation into balancing computational cost against the expressive power and generalization capabilities demanded by emerging multi-image tasks.

By providing a high-dimensional, multi-modality benchmark—unified in data format, protocol, and evaluation—the LLaVA-Interleave Bench underpins comparative analysis and advances in state-of-the-art LMM development, setting reference standards and illuminating future directions for research in interleaved multimodal reasoning (Li et al., 10 Jul 2024, Cuong et al., 13 Jun 2025).