Staged & Cascaded Architectures in Multimodal Models

Updated 21 April 2026

Staged and Cascaded Architectures are structured designs that tokenize and interleave multi-modal data into a unified stream, enabling shared parameterization and cross-modal transfer.
They employ stage-wise token extraction and cascaded interleaving strategies to preserve temporal and structural details across domains such as text, vision, and audio.
These architectures enhance computational efficiency and scalability through token fusion, hierarchical generation, and adaptive training objectives tailored to complex data modalities.

Unified and Interleaved Token Models provide a principled approach to modeling, understanding, and generating complex data involving multiple modalities—such as text, images, audio, time series, and actions—by representing all relevant information as a single, unified stream of tokens, which can be processed by a backbone model with minimal architectural divergence across modalities. These tokens may be discrete (subword units, quantized visual or audio codes, etc.) or continuous (latent representations), and the sequence may interleave tokens from distinct modalities according to temporal or structural logic. This architectural paradigm enables shared parameterization, efficient scaling, unified training objectives, and strong cross-modal transfer, while introducing novel challenges in tokenization strategies, context mixing, scheduling, and information balancing.

Unified and Interleaved Token Models are founded on the principle that all modalities—each with distinct statistical and structural properties—can be mapped into a shared representational space as a serialized, interleaved token stream. The design varies by modality but generally involves:

Tokenization:
- Text: Subword/BPE units (standard LLM practice).
- Vision: Patch-wise or VQ-code discretization (e.g., VAE/VQGAN codes, binary tokenizers with massive codebooks such as $2^{128}$ in UniWeTok (Zhuang et al., 15 Feb 2026)).
- Audio: Semantic and acoustic units via residual vector quantization (Llama-Mimi (Sugiura et al., 18 Sep 2025), SODA (Manakul et al., 18 Feb 2026)).
- Time Series: Quantile binning or continuous embedding (MSE-ITT (Koval et al., 23 Sep 2025)).
- Actions/States: Concatenation of context information with returns and actions into unified state vectors (UTR (Tian et al., 24 Oct 2025)).
- Other domains (recommenders, gestures, etc.): Mixture-of-experts quantization, cross-feature concatenation, or residual quantization (UniTok (Hou et al., 17 Nov 2025), Gelina (Guichoux et al., 13 Oct 2025)).
Interleaving Strategies:
- Strict alternation by time (as in sequential decision models (Tian et al., 24 Oct 2025), Uni-World VLA (Liu et al., 28 Mar 2026)).
- Utterance-level or structural alternation (PaDT interleaves Visual Reference Tokens and text (Su et al., 2 Oct 2025), SODA interleaves text and audio (Manakul et al., 18 Feb 2026)).
- Dynamic injection at generation (OneFlow inserts image latents when a special token is emitted by the text head (Nguyen et al., 3 Oct 2025)).
Unified Backbone:
- Transformers (bidirectional or decoder-only) process the complete token stream, with minimal architectural modality branching.
- Optional modality-specific adapters (e.g., QKV/FFN splits, U-Net heads for vision, modality-specific output projections).

The tight interleaving and unification allow a single model instance to leverage all context, enable in-context learning and cross-modal transfer, and facilitate both understanding and generation tasks.

2. Core Architectural Instantiations Across Domains

This paradigm has been realized across a spectrum of domains, each presenting domain-specific unification strategies while maintaining the shared principles:

Model/Domain	Interleaving Scheme	Token Modalities	Backbone/Heads
OneFlow (Nguyen et al., 3 Oct 2025)	Text & image blocks, insertion-based text	Discrete (text), continuous image latents	Bidirectional Transformer, Edit Flow+FM
PaDT (Su et al., 2 Oct 2025)	Patch-based VRTs with text	Text, patch tokens	LLM with VRT dynamic expansion + decoder
UTR/UDT (Tian et al., 24 Oct 2025)	Fused return-state-action tokens	RL trajectory	Transformer or CNN
SODA (Manakul et al., 18 Feb 2026)	Alternating utterance-level text, semantic+acoustic	Text, semantic/acoustic	Qwen3 Transformer
NextFlow (Zhang et al., 5 Jan 2026)	Flat concatenation, multi-scale image scales	Text, multi-scale visual	Decoder-only Transformer
VINO (Chen et al., 5 Jan 2026)	VLM+VAE latents, interleaved conditioning	Text, image, video latents	Diffusion Transformer
TokenFormer (Zhou et al., 15 Apr 2026)	Multi-field, sequence, target candidates	Mixed categorical/sequent.	Transformer (BFTS+NLIR)
UniTok (Hou et al., 17 Nov 2025)	Codebook-based, MoE, multiexpert	Cross-domain items	Transformer, MoE routing
Llama-Mimi (Sugiura et al., 18 Sep 2025)	Semantic and acoustic tokens, framed by markers	Text, interleaved audio	Llama3 Transformer
Gelina (Guichoux et al., 13 Oct 2025)	Fixed-rate alternation (speech:gesture)	Speech, gestures	Causal transformer

Approaches may employ dynamic embedding tables (PaDT), 3D rotary or interleaving positional encodings (Mogao (Liao et al., 8 May 2025), NextFlow), BFTS attention schemes (TokenFormer), or token compression for efficiency (UniCompress (Wang et al., 11 Mar 2026)).

3. Training Objectives, Scheduling, and Loss Mechanisms

Unified interleaved token models require loss formulations and sampling algorithms that accommodate multimodal, possibly asynchronous or hierarchical data. Key approaches include:

Sequential Next-Token Prediction (NTP): Core in autoregressive models (NextFlow (Zhang et al., 5 Jan 2026), SODA (Manakul et al., 18 Feb 2026), Llama-Mimi (Sugiura et al., 18 Sep 2025), etc.), yielding a single cross-entropy loss over unified vocabulary.
Insertion- and Deletion-Based Objectives: OneFlow models text as a continuous-time insertion CTMC, using Δ-Bernoulli, Poisson, bag-of-tokens losses over predicted insertions and deletions, and flow-matching for images (Nguyen et al., 3 Oct 2025).
Hierarchical / Multi-Scale Generation: NextFlow transitions from next-token (text) to next-scale (images)—each scale’s codebook indices sampled autoregressively, reweighted by token count (Zhang et al., 5 Jan 2026).
Parallel or Interleaved Generation: OneFlow’s algorithm interleaves edit steps and flow steps for parallel text/image synthesis, yielding concurrency and efficiency (Nguyen et al., 3 Oct 2025).
Selective/Weighted Training: MSE-ITT employs SALMON alignment and Salient Token Weighting to focus the loss on cross-modal dependencies (Koval et al., 23 Sep 2025); PaDT’s robust per-token cross-entropy masks out negative VRTs (Su et al., 2 Oct 2025).
Plug-In Compression Losses: UniCompress introduces additional reconstruction and codebook consistency losses to support token-efficient modeling within otherwise standard NTP frameworks (Wang et al., 11 Mar 2026).
Joint Autoregressive/Likelihood and Diffusion Training: Mogao and VINO combine cross-entropy for text with diffusion/flow-matching objectives for vision, supporting joint AR and denoising (Liao et al., 8 May 2025, Chen et al., 5 Jan 2026).

These losses are often blended, with scheduling or weighting (text:vision, coarse:fine) and, in advanced settings, adaptive reweighting or domain-specific calibration (UniTok (Hou et al., 17 Nov 2025), TokenFormer (Zhou et al., 15 Apr 2026)).

4. Efficiency and Scaling: Sequence Length, Compute, and Compression

Unified interleaved models offer large potential efficiency gains, but sequence length can become a computational bottleneck, especially with fine tokenizations or long-horizon tasks. Approaches to efficiency include:

Token Fusion: UTR merges return, state, and action into a single vector, reducing RL trajectory length from $3T$ to $T$ , yielding 9x lower self-attention cost and improved generalization bounds, with empirical FLOPs and runtime savings exceeding 67% (Tian et al., 24 Oct 2025).
Hierarchical Generation: NextFlow’s next-scale strategy enables $1024\times1024$ image synthesis in 5s using KV-caching and only generating incremental new tokens per scale (Zhang et al., 5 Jan 2026).
Token Compression: UniCompress demonstrates that pooled visual tokens (stride-2) with learnable global meta-tokens can cut image token count by up to 4x, with only minor impact on VQA, CLIPScore, and FID for generation/understanding (Wang et al., 11 Mar 2026). Plug-in design ensures no full LLM retraining.
Interleaved Curriculum: Data mixing (pure text, T2I, editing, interleaved video/text) and staged curriculum are widely used to maximize throughput and effective scaling (Mogao, NextFlow, SODA).
Scaling Laws: SODA’s IsoFLOP analysis finds optimal data size grows 1.6x faster than optimal model size; over-trained smaller models benefit runtime efficiency with only modest loss (Manakul et al., 18 Feb 2026).

The unified and interleaved framework enables efficient knowledge transfer and task coverage beyond simple conditional generation:

Bidirectional and Multi-task Support: NextFlow, VINO, and Mogao enable instruction following, editing, and in-context multi-turn exchange, as well as video/sequence generation (Zhang et al., 5 Jan 2026, Chen et al., 5 Jan 2026, Liao et al., 8 May 2025).
Token-Level Cross-Modal References: PaDT’s VRTs, derived from input image patch features and dynamically injected, are used as direct dense-prediction anchors for detection, segmentation, or grounding, outperforming coordinate-generation schemes and static codebooks (Su et al., 2 Oct 2025).
Alignment Mechanisms: MSE-ITT’s SALMON and STW scheme explicitly aligns text and time series tokens with selective token-level reweighting; OneFlow ensures biomechanical alignment between text and images via hierarchical scheduling (Nguyen et al., 3 Oct 2025, Koval et al., 23 Sep 2025).
Autonomous Decision-Making: Uni-World VLA alternates frame prediction and trajectory planning in a tightly interleaved stream, ensuring plans remain causally consistent with current world models, crucial for closed-loop settings (Liu et al., 28 Mar 2026).
Unified Recommendations: In domains such as e-commerce, recommender models such as UniTok and TokenFormer flatten disparate categorical, sequential, and candidate features into a single stream, enabling cross-domain and zero-shot transfer without retraining (Hou et al., 17 Nov 2025, Zhou et al., 15 Apr 2026).

Cross-modal mapping and synergy in these models is achieved via careful embedding design, per-modality adapters or gating, and explicit architectural strategies to prevent subspace collapse or semantic dominance.

6. Limitations, Trade-offs, and Open Research Questions

Despite their architectural appeal and empirical gains, unified and interleaved token models confront several structural trade-offs and open problems:

Context Length and Information Dilution: Fine-grained tokenizations (multiscale visual, audio) can expand sequences to thousands of tokens, challenging both memory and sequence modeling capacity. Compression (e.g., UniCompress (Wang et al., 11 Mar 2026)), hierarchical transformers, or pooling/meta-tokens partially alleviate this, yet may lose spatial/temporal fidelity.
Semantic–Acoustic Trade-offs: In models like Llama-Mimi, increasing the number of acoustic quantizers boosts audio fidelity (e.g., SpeakerSim: 0.346 to 0.474), but degrades language coherence and content retention, not fully resolved even in large models (Sugiura et al., 18 Sep 2025).
Collapse and Discriminability: TokenFormer demonstrates Sequential Collapse Propagation, where naive mixing of low-rank static fields and sequence tokens degrades sequence expressiveness—a phenomenon mitigated by architectural attention design and non-linear gating (Zhou et al., 15 Apr 2026).
Alignment Complexity: The balance between modality-specific processing and true parameter sharing is complex. Selective expert routing (MSE-ITT (Koval et al., 23 Sep 2025), UniTok (Hou et al., 17 Nov 2025)), dynamic embedding tables (PaDT (Su et al., 2 Oct 2025)), and learnable query tokens (VINO (Chen et al., 5 Jan 2026)) provide mechanism, but require careful pretraining and calibration to avoid domain or modality imbalance.
Diffusion / AR Integration: Models blending diffusion and autoregressive factorization (Mogao, VINO, OneFlow) face architectural and objective blending challenges, often requiring curriculum learning, multi-head training, or custom scheduling.
Latency and Parallelism: Autoregressive sampling is not inherently parallel, leading to higher inference latency (notably in speech/gesture synthesis, Gelina’s RTF ≈1.47 (Guichoux et al., 13 Oct 2025)). Non-AR or insertion-based flows (OneFlow (Nguyen et al., 3 Oct 2025)) offer some gains, but general applicability remains an area of research.

Future advances may require more adaptive tokenization (e.g., hierarchical, domain- or context-sensitive), information-theoretic balancing across modalities, and further work on hardware-efficient transformer models and sparsity.

7. Impact and Empirical Evaluation

Unified and Interleaved Token Models have demonstrated competitive or superior performance across a variety of benchmarks:

Generation Metrics: OneFlow achieves FID ≈ 12.1 (vs. AR+FM: 12.2), DPG ≈ 79.1, CLIPScore ≈ 26.6 on text-image tasks (Nguyen et al., 3 Oct 2025); PaDT delivers SOTA open-vocabulary detection (mAP 34.0 vs. 19.2–17.5 in prior work) and segmentation (cIoU 73.4 vs. 69.2) (Su et al., 2 Oct 2025).
Cross-Modal Understanding: OneFlow increases VQA accuracy (57.8 vs. 55.0), and mixed-modal pretraining yields further gains (Nguyen et al., 3 Oct 2025). SODA achieves ASR WER 5.0% zero-shot and S2ST BLEU >20 on several languages (Manakul et al., 18 Feb 2026).
Generality and Transfer: UniTok attains +51.89% NDCG@10 on Tools, +84.5% Recall@10 on Cellphones without per-domain retraining, and further outperforms all retrained baselines in zero-shot transfer (Hou et al., 17 Nov 2025).
Scaling Trends: Scaling laws indicate that clever data-model balancing (SODA (Manakul et al., 18 Feb 2026)), sequence length reduction (UTR (Tian et al., 24 Oct 2025)), and token compression (UniCompress (Wang et al., 11 Mar 2026)) enable real-world deployment in data- or budget-constrained settings.
Expressiveness: The ability to perform concurrent, iterative, or hierarchical mixed-modal generation (OneFlow), tightly coupled prediction/planning (Uni-World VLA), and direct dense prediction from LMs (PaDT) represents unification of task paradigms previously considered distinct.

Unified and Interleaved Token Models, through their algorithmic and architectural innovations, have established a new regime for scalable, generalizable, and efficient multimodal machine learning, with direct empirical impact on vision, language, speech, auditory, recommendation, and sequential-decision domains.