LLaVA-NeXT-Interleave: Unified Multimodal Model

Updated 9 September 2025

The paper introduces a unified architecture that interleaves visual and text tokens, eliminating the need for scenario-specific tuning.
It leverages a modular pipeline—combining a vision encoder, intermediate projector, and instruction-tuned LLM—to achieve leading performance on diverse benchmarks.
Comprehensive evaluations using the M4-Instruct dataset and LLaVA-Interleave Bench demonstrate robust cross-domain generalization and emergent zero-shot adaptation.

LLaVA-NeXT-Interleave is an open-source large multimodal model (LMM) designed to natively handle complex visual and multimodal reasoning across multi-image, video (multi-frame), 3D (multi-view), and multi-patch (single high-resolution image) scenarios within a unified, interleaved token framework. It builds upon the LLaVA-NeXT family by introducing architectural and data innovations that enable generalization, zero-shot adaptation, and cross-domain instruction following beyond single-image tasks. The model is supported by a purpose-built multi-domain dataset (M4-Instruct) and a comprehensive evaluation suite (LLaVA-Interleave Bench), and demonstrates leading performance on a wide range of multimodal benchmarks.

1. Modular Architecture and Interleaved Tokenization

LLaVA-NeXT-Interleave generalizes the LLaVA-NeXT single-image foundation by adopting a modular, three-part pipeline:

Vision encoder: Converts each image or frame (or high-resolution patch) into fixed-length feature embeddings. SigLIP–400M is commonly used, leveraging “any resolution” designs where high-resolution images are subdivided into smaller patches.
Intermediate projector: Typically a two-layer MLP aligns vision embeddings with the LLM’s input space.
Instruction-tuned LLM: Receives a sequence of “interleaved” visual and text tokens, enabling seamless reasoning over multimodal contexts.

The central method involves arranging multimodal input as a linear sequence:

1	⟨image₁⟩ Text₁ ⟨image₂⟩ Text₂ … ⟨image_N⟩ Text_N

Here, ⟨image_i⟩ signifies an embedded visual unit (patch, frame, or view), and intervening text tokens provide context, questions, or instructions. This interleaved representation eliminates the need for scenario-specific architectures—in contrast to prior approaches which treat video, 3D, or multi-image tasks separately—allowing all modalities to be processed within one unified Transformer-based paradigm (Li et al., 10 Jul 2024).

2. Interleaved Data Format: Design and Implications

The interleaved data format merges image and text information into a contiguous token stream, supporting:

Flexible modality ordering: Images and text can be alternated in their natural order or grouped (e.g., all images in front), facilitating both training and inference strategies.
Unified treatment of scenarios: The same pipeline handles single-image, multi-image, video (multi-frame), and multi-view (3D) inputs.
In-context multimodal learning: Demonstrative examples (interleaved images and texts) can be presented directly in the input, fostering multimodal in-context learning and compositional reasoning.

By adopting a single, scenario-agnostic template, LLaVA-NeXT-Interleave both simplifies system design (removing scenario-aware tokenization) and enables broad generalization and transfer.

3. M4-Instruct: Multi-Domain Instructional Corpus

The capabilities of LLaVA-NeXT-Interleave derive from the scale and heterogeneity of the M4-Instruct dataset, comprising:

1,177,600 samples collected from 41 datasets, spanning 14 canonical multimodal tasks.
Task selection includes:
- Multi-image: Tasks such as “spot the difference,” visual storytelling, edit instruction generation, multi-image puzzles, and image-dialogue.
- Multi-frame (video): Video detailed captioning and VQA, using sequential frame decomposition.
- Multi-view (3D): Embodied VQA and 3D scene QA leveraging spatial viewpoint variation.
- Multi-patch: Single high-resolution images split into spatial patches, preserving fine detail for single-image tasks.
Domain variation: Samples range from real-world scenes, cartoons, synthetic images, to surveillance video, promoting cross-domain generalization and robustness.

The dataset’s breadth is designed to elicit emerging behaviors by presenting diverse yet unified instruction-following multimodal signals during training.

4. Benchmarks, Evaluation, and Comparative Performance

LLaVA-NeXT-Interleave is systematically evaluated using the LLaVA-Interleave Bench and auxiliary cross-domain leaderboards:

Multi-image: Outperforms open-source competitors on NLVR2, Q-Bench, BLINK (in-domain) and MathVerse-mv, SciVerse-mv (out-domain).
Video tasks: Achieves leading accuracy and open-ended generation (NExT-QA, STAR), using either pooled vision tokens or video-DPO fine-tuning.
3D/multi-view: Delivers state-of-the-art results on ScanQA, nuScenes VQA, and 3D-LLM benchmarks—operating solely on multi-view images without point clouds.
Single-image integrity: Maintains or surpasses baseline performance on classic tasks, demonstrating that expanded modalities do not degrade single-image reasoning.

Experimental findings confirm that a unified architecture and interleaved format can concurrently improve multimodal and classic benchmarks, contrasting with prior modular approaches that require domain-specific tuning (Li et al., 10 Jul 2024).

5. Emergent Properties and Transfer Capabilities

Training on a mixed, interleaved objective yields novel emergent behaviors:

Task and modality transfer: Capable of recontextualizing single-image skills (e.g., humor analysis) to multi-image narratives, or generating Twitter-style posts from both images and videos.
Generalized cross-document reasoning: Able to summarize or compose answers over multiple documents (e.g., slide deck or multi-document VQA) without explicit training for such transfer.
Zero-shot/few-shot adaptation: Compositional training enables unanticipated generalization to unseen instruction types and domains.

These effects suggest internal acquisition of latent, cross-modal compositional mechanisms, supported by the diversity and format of the training data.

6. Practical Applications and Implementation

LLaVA-NeXT-Interleave supports a spectrum of real-world deployments:

Multimedia content analysis, including visual stories, surveillance review, and creative visual editing.
Temporal and spatial reasoning in video and 3D robotics, via detailed captioning and VQA over sequences of frames or views.
Document and slide summarization by sequentially interleaving page or slide images and analyzing the combined context.

Deployment is facilitated by open-source release with detailed instructions and code for both training and inference, including:

Guidelines for assembling interleaved datasets.
Tunable input arrangement (order, patching).
Extension templates for new vision encoders or LLMs (Li et al., 10 Jul 2024).

Advances derived from LLaVA-NeXT-Interleave inform further directions:

Adaptive granularity techniques (e.g., AVG-LLaVA (Lan et al., 20 Sep 2024)) build on the interleaved framework by reducing visual token redundancy and improving efficiency, illustrating downstream extensibility of the approach.
Plug-and-play multimodal connectors (e.g., DCI (Cuong et al., 13 Jun 2025)) can enhance or complement interleaved baselines—providing semantic fusion or early convergence in structured reasoning settings.

A plausible implication is that the interleaved token paradigm, coupled with diverse instructional data and open benchmarks, provides a robust launch point for universal, scenario-agnostic LMM systems where task- and modality-specific engineering becomes increasingly unnecessary.