Unified & Interleaved Token Models

Updated 21 April 2026

Unified and Interleaved Token Models are architectures that integrate various data types into one continuous token stream using shared backbones and harmonized tokenization.
They employ strategies like strict unification, modality-specific experts, and hybrid interleaving to effectively blend text, vision, audio, and other modalities.
These models deliver efficiency gains and state-of-the-art performance in tasks such as multimodal generation, understanding, and recommendation while addressing challenges like modality interference.

Unified and Interleaved Token Models refer to a class of architectures, tokenization strategies, and training objectives that enable foundational models to process, generate, and understand sequences interleaving multiple modalities (such as text, vision, audio, time series, and actions) within a single continuous token stream. Instead of isolated modality-specific pipelines or fixed-causal orderings, these models leverage shared or harmonized backbones with carefully designed tokenization and interleaving schemes so that all modalities can be natively modeled, aligned, and generatively synthesized in concert.

1. Architectural Foundations of Unified and Interleaved Token Models

Unified and interleaved token modeling generally organizes multimodal data—including discrete (text, categorical items), continuous (images, signals), or hybrid (latents)—into a single flat or structured sequence that a shared or partially shared sequence model processes. There are three canonical strategies:

Strict unification with a shared decoder: All input modalities are quantized or embedded into tokens drawn from a universal vocabulary, and a single autoregressive or non-autoregressive backbone (often a Transformer) predicts the next token over this space. Visual, audio, and text tokens are handled and predicted by the same head or a simple modality-distinguishing head. Notable examples include NextFlow (Zhang et al., 5 Jan 2026), Mogao (Liao et al., 8 May 2025), and SODA (Manakul et al., 18 Feb 2026).
Unified architectures with modality-specific experts: The backbone maintains a unified sequence of interleaved tokens, while internal layers use specialized parameterizations (e.g., per-modality QKV projections, FFNs, or normalization) to process particular token types, but allow information exchange via cross-modal attention. This approach is exemplified by MSE-ITT for text and time series (Koval et al., 23 Sep 2025), TokenFormer for recommendation (Zhou et al., 15 Apr 2026), and PaDT for unified multi-modal vision (Su et al., 2 Oct 2025).
Hybrid schemes with interleaved synchronization: Some models achieve unification by synchronizing modalities (e.g., by alternating generations of actions and world-states, or speech and gesture tokens) within a single stream, while utilizing appropriate tokenization and alignment heads. Gelina (Guichoux et al., 13 Oct 2025) and Uni-World VLA (Liu et al., 28 Mar 2026) are representative.

These models rely on shared or dynamically expandable embedding tables and positional encoding schemes that support complex interleaving (e.g., time, spatial, and scale axes for vision; semantic and acoustic axes for audio).

A defining feature of these models is the design of modality-independent or modality-harmonized discrete tokenizations. Common approaches include:

Vector Quantization (VQ/VQVAE) and Codebooks: Used extensively for images, audio, and actions, as in NextFlow’s dual-codebook image tokens (Zhang et al., 5 Jan 2026), SODA’s Mimi codebooks (Manakul et al., 18 Feb 2026), Llama-Mimi’s semantic/acoustic quantizers (Sugiura et al., 18 Sep 2025), and UniWeTok’s $2^{128}$ binary codebook (Zhuang et al., 15 Feb 2026).
Latent Representation Alignment: Some models, e.g., OneFlow (Nguyen et al., 3 Oct 2025), combine discrete token insertions (for text) with continuous latent flows (for vision), and synchronize their sampling schedules via explicit time/interleaving algorithms.
Dynamic or Expandable Embedding Tables: Patch-as-Decodable-Token (PaDT) (Su et al., 2 Oct 2025) extends the LLM embedding table on-the-fly for each image by adding Visual Reference Tokens, supporting dense visual output tasks.
Domain and Modality Experts: UniTok (Hou et al., 17 Nov 2025) routes unified embeds through domain- or modality-specific codebook experts with shared backbones, then quantizes each for unified item recommendation.

Sequences are interleaved according to task requirements, with examples including utterance-level (text–audio alternation) (Manakul et al., 18 Feb 2026), time-aligned sequence alternation (speech–gesture sync) (Guichoux et al., 13 Oct 2025), and frame-action chaining for world modeling (Liu et al., 28 Mar 2026). Position encoding schemes are augmented to provide not only sequence position but also spatial, temporal, and (in vision/audio) scale/depth axes (e.g., NextFlow’s 3D RoPE (Zhang et al., 5 Jan 2026), Mogao’s IL-RoPE (Liao et al., 8 May 2025)).

3. Objectives, Training Paradigms, and Losses

Unified and interleaved models are trained with efficiently balanced multitask objectives:

Next-Token Prediction (NTP) Cross-Entropy: Standard for interleaved text and discrete tokens (images, audio, time series), e.g., SODA and NextFlow. In unified models, the same loss covers both natural language and codebook indices for non-text modalities.
Hierarchical or Modality-Synchronized Losses: OneFlow uses Edit Flow (continuous-time Markov chain insertion/deletion for text) and Flow Matching loss for image latents, synchronizing both via an interleaved time schedule (Nguyen et al., 3 Oct 2025). Gelina combines AR CE for speech tokens and flow-matching plus geodesic loss for gesture tokens (Guichoux et al., 13 Oct 2025).
Contrastive and Alignment Losses: Cross-modal alignment, mutual-information calibration (UniTok (Hou et al., 17 Nov 2025)), and pre/post distillation for enhancing semantics of quantized tokens (UniWeTok (Zhuang et al., 15 Feb 2026)) are used to ensure cross-modal information consistency.
Compression-Aware Losses: When employing token reduction mechanisms (e.g., UniCompress (Wang et al., 11 Mar 2026)), auxiliary reconstruction and codebook consistency losses are introduced for compressed representations.
Multitask or Curriculum Training: Many models employ progressive curricula, staging from single-modality to multi-modal, then interleaved multitask training (e.g., Mogao (Liao et al., 8 May 2025), VINO (Chen et al., 5 Jan 2026)). Reinforcement learning via prefix-tuning (NextFlow (Zhang et al., 5 Jan 2026)) or classifier-free guidance (Mogao, VINO) refines generation and alignment.

4. Empirical Results, Efficiency, and Scaling

Unified and interleaved token models consistently report:

Compression and Efficiency Gains: Sequence length and compute reduction are significant, e.g., UTR reduces RL trajectory sequence lengths by 3× and attention cost by 9× compared to DT (Tian et al., 24 Oct 2025); OneFlow halves FLOPs compared to AR+Flow Matching baselines (Nguyen et al., 3 Oct 2025); UniCompress cuts visual token count 4× and improves inference latency by up to 42% (Wang et al., 11 Mar 2026).
State-of-the-Art Multimodal Performance: NextFlow achieves state-of-the-art multimodal image generation and unified text-image understanding (Zhang et al., 5 Jan 2026); Mogao and VINO match or outperform task-specific T2I and VQA models while enabling in-context interleaved and editing tasks (Liao et al., 8 May 2025, Chen et al., 5 Jan 2026).
Scalability and Data Laws: SODA's IsoFLOP scaling study shows optimal data grows ~1.6× faster than optimal model size for audio, in contrast to 1:1 in text LLMs (Manakul et al., 18 Feb 2026).
Modality-Specific Insights: Llama-Mimi demonstrates that increasing the number of acoustic quantizers increases audio fidelity but can degrade long-term linguistic coherence (Sugiura et al., 18 Sep 2025). Models such as UniTok show substantial cross-domain generalizability, maintaining accuracy across previously unseen recommendation domains without retraining (Hou et al., 17 Nov 2025).

These results support the notion that unified interleaved architectures can reach or surpass specialized systems while yielding efficiency and flexibility.

5. Applications and Task Coverage

The unified and interleaved token framework has enabled progress in a spectrum of domains:

Multimodal Generation: Joint and interleaved text–image (Mogao, NextFlow, OneFlow), text–audio–speech (SODA, Llama-Mimi), speech–gesture (Gelina), and vision–language–action for driving/planning (Uni-World VLA).
Multimodal Understanding and Alignment: VQA, image captioning, referencing, open-vocabulary detection, segmentation, and referral tasks are unified under shared backbones and compatible training routines (PaDT (Su et al., 2 Oct 2025), VINO (Chen et al., 5 Jan 2026), Mogao (Liao et al., 8 May 2025)).
Cross-Domain Foundation Models: Recommendation and representation learning are advanced by single-tokenizer frameworks covering multiple fields, items, or behavior streams (UniTok (Hou et al., 17 Nov 2025), TokenFormer (Zhou et al., 15 Apr 2026)).
Resource-Constrained Deployment: Token compression and blending (UniCompress (Wang et al., 11 Mar 2026)) and lightweight expert routing promise practical application for on-device, real-time, or embedded multimodal systems.
Zero-Shot and Editing Capabilities: Emergent capabilities such as zero-shot image editing, reference-driven video generation, in-context multimodal prompting, and task transfer arise directly from the unified tokenization and sequence modeling framework (Liao et al., 8 May 2025, Chen et al., 5 Jan 2026).

6. Challenges, Trade-offs, and Open Problems

Unified and interleaved token models yield significant advantages, but critical technical and methodological challenges remain:

Modality Interference and Collapse: Naive unification can lead to phenomena such as Sequential Collapse Propagation, where low-rank static embeddings degrade sequence representation expressivity (TokenFormer (Zhou et al., 15 Apr 2026)). Deep-fusion and gating mechanisms are essential to avoid such failure modes.
Tokenization Trade-offs: Increasing quantizer granularity enhances fine detail or fidelity (e.g., in audio and vision) but can harm sequence coherence, as shown in Llama-Mimi (Sugiura et al., 18 Sep 2025) and SODA (Manakul et al., 18 Feb 2026). Dynamic, hierarchical, or adaptively learned tokenizations are plausible solutions but remain under investigation.
Cross-Modal Alignment: Ensuring that information propagates between modalities at the right layers and at the right granularity is non-trivial, demanding advances in attention masking, dynamic embedding tables, and alignment objectives (MSE-ITT (Koval et al., 23 Sep 2025), PaDT (Su et al., 2 Oct 2025)).
Sequence Length and Compute: Despite unification, visual and audio modalities can yield prohibitively long token streams, making compression (UniCompress (Wang et al., 11 Mar 2026)), KV-caching (NextFlow (Zhang et al., 5 Jan 2026)), or hybrid CNN attention (UDC (Tian et al., 24 Oct 2025)) necessary for practical systems.
Task-Specific and Scaling Hyperparameters: Scale-reweighting, curriculum schedules, and expert regularization must be tuned to prevent decoding instability or mode collapse, especially at scale or as new domains/modalities are introduced.

A plausible implication is that future unified architectures will require dynamically adaptive tokenization and hierarchical attention mechanisms, as well as principled hybridization of autoregressive, diffusion, and flow-matching objectives tailored per modality and task.

Collectively, unified and interleaved token models represent a paradigm shift in multimodal foundation modeling, enabling rich joint understanding and generation across heterogeneous signals, with scalable architectures and emergent in-context capabilities. The technical maturity demonstrated by recent work across domains—language–vision (Zhang et al., 5 Jan 2026, Liao et al., 8 May 2025), audio (Manakul et al., 18 Feb 2026, Sugiura et al., 18 Sep 2025), sequential decision (Tian et al., 24 Oct 2025), world modeling (Liu et al., 28 Mar 2026), recommendation (Hou et al., 17 Nov 2025, Zhou et al., 15 Apr 2026)—suggests this is now an established and rapidly evolving area foundational to future AI systems.