Papers
Topics
Authors
Recent
2000 character limit reached

LLaVA-NeXT-Interleave: Unified Multimodal Model

Updated 9 September 2025
  • The paper introduces a unified architecture that interleaves visual and text tokens, eliminating the need for scenario-specific tuning.
  • It leverages a modular pipeline—combining a vision encoder, intermediate projector, and instruction-tuned LLM—to achieve leading performance on diverse benchmarks.
  • Comprehensive evaluations using the M4-Instruct dataset and LLaVA-Interleave Bench demonstrate robust cross-domain generalization and emergent zero-shot adaptation.

LLaVA-NeXT-Interleave is an open-source large multimodal model (LMM) designed to natively handle complex visual and multimodal reasoning across multi-image, video (multi-frame), 3D (multi-view), and multi-patch (single high-resolution image) scenarios within a unified, interleaved token framework. It builds upon the LLaVA-NeXT family by introducing architectural and data innovations that enable generalization, zero-shot adaptation, and cross-domain instruction following beyond single-image tasks. The model is supported by a purpose-built multi-domain dataset (M4-Instruct) and a comprehensive evaluation suite (LLaVA-Interleave Bench), and demonstrates leading performance on a wide range of multimodal benchmarks.

1. Modular Architecture and Interleaved Tokenization

LLaVA-NeXT-Interleave generalizes the LLaVA-NeXT single-image foundation by adopting a modular, three-part pipeline:

  • Vision encoder: Converts each image or frame (or high-resolution patch) into fixed-length feature embeddings. SigLIP–400M is commonly used, leveraging “any resolution” designs where high-resolution images are subdivided into smaller patches.
  • Intermediate projector: Typically a two-layer MLP aligns vision embeddings with the LLM’s input space.
  • Instruction-tuned LLM: Receives a sequence of “interleaved” visual and text tokens, enabling seamless reasoning over multimodal contexts.

The central method involves arranging multimodal input as a linear sequence:

1
⟨image₁⟩ Text₁ ⟨image₂⟩ Text₂ … ⟨image_N⟩ Text_N
Here, ⟨image_i⟩ signifies an embedded visual unit (patch, frame, or view), and intervening text tokens provide context, questions, or instructions. This interleaved representation eliminates the need for scenario-specific architectures—in contrast to prior approaches which treat video, 3D, or multi-image tasks separately—allowing all modalities to be processed within one unified Transformer-based paradigm (Li et al., 10 Jul 2024).

2. Interleaved Data Format: Design and Implications

The interleaved data format merges image and text information into a contiguous token stream, supporting:

  • Flexible modality ordering: Images and text can be alternated in their natural order or grouped (e.g., all images in front), facilitating both training and inference strategies.
  • Unified treatment of scenarios: The same pipeline handles single-image, multi-image, video (multi-frame), and multi-view (3D) inputs.
  • In-context multimodal learning: Demonstrative examples (interleaved images and texts) can be presented directly in the input, fostering multimodal in-context learning and compositional reasoning.

By adopting a single, scenario-agnostic template, LLaVA-NeXT-Interleave both simplifies system design (removing scenario-aware tokenization) and enables broad generalization and transfer.

3. M4-Instruct: Multi-Domain Instructional Corpus

The capabilities of LLaVA-NeXT-Interleave derive from the scale and heterogeneity of the M4-Instruct dataset, comprising:

  • 1,177,600 samples collected from 41 datasets, spanning 14 canonical multimodal tasks.
  • Task selection includes:
    • Multi-image: Tasks such as “spot the difference,” visual storytelling, edit instruction generation, multi-image puzzles, and image-dialogue.
    • Multi-frame (video): Video detailed captioning and VQA, using sequential frame decomposition.
    • Multi-view (3D): Embodied VQA and 3D scene QA leveraging spatial viewpoint variation.
    • Multi-patch: Single high-resolution images split into spatial patches, preserving fine detail for single-image tasks.
  • Domain variation: Samples range from real-world scenes, cartoons, synthetic images, to surveillance video, promoting cross-domain generalization and robustness.

The dataset’s breadth is designed to elicit emerging behaviors by presenting diverse yet unified instruction-following multimodal signals during training.

4. Benchmarks, Evaluation, and Comparative Performance

LLaVA-NeXT-Interleave is systematically evaluated using the LLaVA-Interleave Bench and auxiliary cross-domain leaderboards:

  • Multi-image: Outperforms open-source competitors on NLVR2, Q-Bench, BLINK (in-domain) and MathVerse-mv, SciVerse-mv (out-domain).
  • Video tasks: Achieves leading accuracy and open-ended generation (NExT-QA, STAR), using either pooled vision tokens or video-DPO fine-tuning.
  • 3D/multi-view: Delivers state-of-the-art results on ScanQA, nuScenes VQA, and 3D-LLM benchmarks—operating solely on multi-view images without point clouds.
  • Single-image integrity: Maintains or surpasses baseline performance on classic tasks, demonstrating that expanded modalities do not degrade single-image reasoning.

Experimental findings confirm that a unified architecture and interleaved format can concurrently improve multimodal and classic benchmarks, contrasting with prior modular approaches that require domain-specific tuning (Li et al., 10 Jul 2024).

5. Emergent Properties and Transfer Capabilities

Training on a mixed, interleaved objective yields novel emergent behaviors:

  • Task and modality transfer: Capable of recontextualizing single-image skills (e.g., humor analysis) to multi-image narratives, or generating Twitter-style posts from both images and videos.
  • Generalized cross-document reasoning: Able to summarize or compose answers over multiple documents (e.g., slide deck or multi-document VQA) without explicit training for such transfer.
  • Zero-shot/few-shot adaptation: Compositional training enables unanticipated generalization to unseen instruction types and domains.

These effects suggest internal acquisition of latent, cross-modal compositional mechanisms, supported by the diversity and format of the training data.

6. Practical Applications and Implementation

LLaVA-NeXT-Interleave supports a spectrum of real-world deployments:

  • Multimedia content analysis, including visual stories, surveillance review, and creative visual editing.
  • Temporal and spatial reasoning in video and 3D robotics, via detailed captioning and VQA over sequences of frames or views.
  • Document and slide summarization by sequentially interleaving page or slide images and analyzing the combined context.

Deployment is facilitated by open-source release with detailed instructions and code for both training and inference, including:

  • Guidelines for assembling interleaved datasets.
  • Tunable input arrangement (order, patching).
  • Extension templates for new vision encoders or LLMs (Li et al., 10 Jul 2024).

Advances derived from LLaVA-NeXT-Interleave inform further directions:

  • Adaptive granularity techniques (e.g., AVG-LLaVA (Lan et al., 20 Sep 2024)) build on the interleaved framework by reducing visual token redundancy and improving efficiency, illustrating downstream extensibility of the approach.
  • Plug-and-play multimodal connectors (e.g., DCI (Cuong et al., 13 Jun 2025)) can enhance or complement interleaved baselines—providing semantic fusion or early convergence in structured reasoning settings.

A plausible implication is that the interleaved token paradigm, coupled with diverse instructional data and open benchmarks, provides a robust launch point for universal, scenario-agnostic LMM systems where task- and modality-specific engineering becomes increasingly unnecessary.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to LLaVA-NeXT-Interleave.