Interleaved Visual Token in Multimodal Models

Updated 25 February 2026

Interleaved Visual Token is a multimodal representation strategy that alternates visual and textual tokens within a unified sequence, enabling robust cross-modal reasoning.
It employs modality-specific tokenization with a deterministic interleaving schedule to maintain precise spatial, temporal, and contextual alignment.
Empirical results indicate notable gains in efficiency, identity preservation, and controllability across tasks like generation, retrieval, and reasoning.

An interleaved visual token refers to a multimodal representation strategy in which visual and (often) textual token embeddings are alternately woven into a single, unified sequence for input to a transformer, diffusion model, or LLM. Unlike the conventional approach, which processes modalities in isolation or via late fusion, interleaved tokenization brings together visual and textual information at the sequence level, aligning tokens temporally, spatially, or instructionally to promote richer cross-modal attention and joint reasoning. This paradigm has become central in multimodal generation, perception, and reasoning frameworks, as it enhances grounding, preserves identity, and supports controllable, in-context computation.

1. Principles of Interleaved Visual Token Construction

The construction of interleaved visual tokens is contingent upon modality-specific tokenization and embedding, an explicit interleaving schedule, and joint alignment within a shared embedding space.

Tokenization and Embedding

Each modality (text, static images, video frames) is separately tokenized and embedded into a fixed-dimensional vector space using modality-specific encoders, followed by a lightweight projection (usually a small MLP) to match the backbone’s dimension $d$ $d$ . For example:
- Text: $h^{\text{text}}_j = E_{\text{text}}(t_j) \in \mathbb{R}^d$
- Image: $h^{\text{img}}_i = E_{\text{img}}(p_i) \in \mathbb{R}^d$
- Video: $h^{\text{vid}}_{t,i} = E_{\text{vid}}(v_{t,i}) \in \mathbb{R}^d$ (Chen et al., 5 Jan 2026)
Low-level spatial latents (e.g., VAE outputs) may be included for improved texture and fine detail, as these are critical for diffusion-based generation or dense prediction tasks.

Interleaving Operation

A deterministic schedule $\pi$ orders the tokens, governing not only the sequence but also the insertion of modality boundary markers (e.g., “vision start”/“vision end”), spatial groupings, or query tokens.
In image–video–text settings (e.g., VINO), all image blocks (fully bracketed) precede video blocks, which precede textual embeddings; this supports spatial contiguity, temporal continuity, and controlled balancing between visual and linguistic context (Chen et al., 5 Jan 2026).
In video-language modeling or video question answering, tokens are grouped per-frame, typically as image tokens, optional subtitles, then appended text or instructional tokens (Ataallah et al., 2024, Wang et al., 5 Oct 2025).

2. Integration into Transformer and Diffusion Architectures

Cross-Attention and Unified Embedding Spaces

The interleaved sequence $H_{\text{cond}} \in \mathbb{R}^{L \times d}$ is processed by the backbone model (e.g., Multimodal Diffusion Transformer, MMDiT, or LLM), with both self-attention and cross-attention operating over all tokens.
Each latent query for the target generation (image/video) attends to the interleaved context:

$Q = W^Q z_t, \quad K = W^K H_{\text{cond}}, \quad V = W^V H_{\text{cond}}$

$\text{CrossAttn}(Q,K,V) = \text{softmax}\left(\frac{Q K^\top}{\sqrt{d}}\right) V$

The fixed schedule ensures that modality-specific query/key/value matrices operate over organized token groups, leveraging positional embeddings (e.g., RoPE) to align timeline or spatial reference across modalities (Chen et al., 5 Jan 2026).

Masking and Causal Attention

For context-compression or optical language modeling tasks (e.g., VIST2), masking prevents tokens from attending to unauthorized chunks, enabling attentive “memory” of interleaved visual tokens while preserving causal structure (Jiao et al., 15 Jan 2026).
Layer-wise compression methods can further reduce token length in early LLM blocks and upsample in later layers to recover fine detail, as evidenced in LVTC modules (Lu et al., 27 Mar 2025).

3. Taxonomy of Interleaved Visual Token Applications

Application Domain	Interleaving Pattern	Primary Model
Unified image–video generation	[images] [videos] [text] [learnable query tokens]	MMDiT (VINO) (Chen et al., 5 Jan 2026)
Video QA	[frame image] [subtitle] [frame image] [subtitle]...[Q]	LLM (MiniGPT4-Video) (Ataallah et al., 2024)
Chain-of-thought reasoning	[text tokens] [visual patch tokens] [text] [visual tokens]	Bagel/LLM (ThinkMorph) (Gu et al., 30 Oct 2025)
Mathematical step grounding	[visual tokens for region] [step text] [visual tokens] [step]	Qwen2-VL-7B (MINT-CoT) (Chen et al., 5 Jun 2025)
Global OCR compression	[visual tokens for chunk] [text tokens chunk] [...repeated]	Transformer (VIST2) (Jiao et al., 15 Jan 2026)
Retrieval on interleaved docs	[text chunk] [image tokens] [text] [image]...	DeepSeek-VL/MLLM, MME (Zhang et al., 18 Feb 2025)

Interleaving unifies disparate tasks—generation, editing, retrieval, reasoning, and OCR—within models that benefit from persistent, tightly aligned multimodal streams.

4. Empirical Benefits and Model Properties

A synthesis of results across domains highlights the unique advantages of the interleaving approach:

Multi-reference Grounding and Identity Preservation: By explicitly bracketing each visual reference, models can prevent identity leakage and attribute entanglement, ensuring that information from specific input images or video sections remains grouped. VINO demonstrates strong preservation of multi-identity edits and avoids structural artifacts due to rigorous boundary marker usage (Chen et al., 5 Jan 2026).
Instruction Following and Controllability: Cross-modal, interleaved inputs support long-form instruction following. Empirically, ablations confirm that removing interleaved structure (or special tokens) leads to loss of coherence in static/dynamic content and poorer subject preservation (A.3 in (Chen et al., 5 Jan 2026)).
Efficiency and Token Budget Management: In settings such as VIST2, compressing and interleaving visual tokens with text chunks yields 3× faster first-token generation, 77% less memory use, and 74% lower FLOPS at a 4× compression ratio without significant performance loss (Jiao et al., 15 Jan 2026).

Model/Task	Interleaved Structure	Core Gains
MiniGPT4-Video	Alternating frame visual/text	+4.2–20.8% video QA
VINO	Modality-blocked seq	Multi-identity preservation & edit fidelity
VIST2	[vis] [text] [vis] [text]...	3× latency, 77% memory, 74% FLOPS reduction

5. Advanced Interleaving: Region and Token-Level Adaptation

Beyond simple alternating patterns, recent approaches have advanced to dynamically adapt visual token selection:

Fine-grained Dynamic Region Selection: MINT-CoT and PaDT allocate visual tokens selectively to arbitrarily shaped, semantically meaningful regions, rather than only box or grid-based patches. For each reasoning step, an “Interleave Token” is predicted that queries for all patches relevant to the current chain-of-thought fragment, as determined by learned similarity to step context (Chen et al., 5 Jun 2025).
Dynamic Vocabulary Expansion: In PaDT, each patch in the current image is represented in a dynamic multimodal codebook, appended to the output space at each inference step, enabling precise localization and differentiation between similar objects (Su et al., 2 Oct 2025).

6. Training Strategies and Loss Functions

Unified models employing interleaved visual tokens use multitarget training objectives:

Cross-entropy on Text and Visual Tokens: Losses often simultaneously optimize likelihood of both next text and visual tokens. In pure autoregressive settings, a single NTP (next-token prediction) loss is computed over interleaved text and image tokens (Song et al., 18 Mar 2025, Qin et al., 24 Nov 2025).
Auxiliary Losses: For region-level tasks, binary cross-entropy supervises token-to-region associations (MINT-CoT), and MSE or perceptual losses are enforced on reconstructions (PaDT, DualToken).

7. Theoretical and Practical Implications

The interleaved visual token paradigm eliminates the boundary between visual and linguistic context at the token level, fostering deeper cross-modal representation learning and supporting modalities beyond image and text, such as video and structured reasoning. This approach:

Provides early fusion for “in-context” computation, allowing bidirectional flows of information between modalities throughout the model stack, in contrast to late-fusion or pooled representations.
Ensures scalable, task-agnostic design; for example, VINO and MM-Interleaved achieve robust performance on both creation and editing without task-specialized modules (Chen et al., 5 Jan 2026, Tian et al., 2024).
Enables efficient retrieval and context compression, as in TIIR and VIST2, by reducing context sequence length without sacrificing alignment or task accuracy (Zhang et al., 18 Feb 2025, Jiao et al., 15 Jan 2026).

Empirical evidence across reasoning, retrieval, OCR, and generation benchmarks repeatedly demonstrates the critical role of interleaving for grounded, precise, and controllable multimodal intelligence.