Autoregressive Multimodal LLMs
- Autoregressive Multimodal LLMs are unified generative models that extend next-token prediction to various modalities like text, images, and audio.
- They employ a single decoder-only Transformer enhanced by modality-specific adapters, projection layers, and mixture-of-experts for coherent output.
- Training leverages unified losses and staged curricula that enable robust multimodal reasoning, spatial grounding, and dynamic mode switching.
Autoregressive Multimodal LLMs (MLLMs) are unified generative models that extend the autoregressive next-token prediction paradigm of LLMs to encompass multiple data modalities—such as text, images, audio, video, motion, and coordinate sequences. These models are designed to ingest, process, and generate complex multimodal sequences in a left-to-right causal fashion, leveraging shared or extended vocabularies and unified embedding spaces. Key innovations span from unified tokenization schemes (spatial, discrete, or continuous) and modality-specific architectural adaptations to dynamic reasoning mode switching, automated cross-modal grounding, and scalable training regimes that preserve both general language and modality-specific capabilities.
1. Architectural Foundations and Unified Tokenization
Autoregressive MLLMs rely on a single backbone Transformer, typically decoder-only, whose core next-token prediction objective is extended to accommodate new modalities via unified or specialized tokenization approaches. Depending on the model, input modalities are discretized (e.g., VQ‐VAE visual tokens, "visual words"), projected into the language space via adapters or X2L interfaces, or represented as continuous embeddings (e.g., causal VAE latents for motion). Token streams are concatenated for joint attention, often with specialized boundary, modality, or task tokens (Wu et al., 2024, Ren et al., 11 Dec 2025).
Major tokenization approaches include:
| Approach | Description | Key Models |
|---|---|---|
| Spatial Discretization | VQ‐VAE/VQGAN codes, patch tokens | Liquid, AR-Omni, JAM |
| Visual Words | Map image patches to text-vocab distributions | VW-LMM |
| Diffusion Timestep Tokens | Recursive, order-sensitive visual language | DDT-LLaMA |
| Point Tokens/Embeddings | Continuous/embedded waypoints for trajectories | AutoTraces |
| Audio/Speech Quantization | Single-codebook acoustic tokens | AR-Omni, Llama-AVSR |
| Causal Continuous Latents | Streaming continuous embeddings for motion | LLaMo |
| Explicit Spatial Tokens | Grid and offset tokens for 2D reasoning | GETok |
This design allows a unified decoder to process arbitrary interleavings of text and non-text modalities, supporting "any-to-any" generation by virtue of a joint vocabulary and/or interleaved sequence—without requiring modality-specific decoders (Cheng et al., 25 Jan 2026, Wu et al., 2024).
2. Modality Integration and Representation Strategies
Modality integration is achieved through a combination of architectural and representational enhancements:
- Adapters and Projection Layers: Frozen modality-specific encoders (e.g., CLIP, Whisper, AV-HuBERT, VQVAE) map raw data into fixed-length vectors, which are projected into the LLM token space via lightweight adapters or multi-layer perceptrons (Li et al., 2024, Cappellazzo et al., 2024).
- Modality-Specific Decoder Branches and MoEs: Modality-dependent "mixture-of-experts" blocks (MoT, Visual Attention Experts) enable adaptation of attention/QKV projections for distinct token types, while preserving the frozen parameters for core language modeling (Li et al., 12 Feb 2026, She et al., 2024).
- Specialized Token/-Type Vocabularies: Dedicated task, grounding, spatial, or reasoning tokens are appended to signal modality boundaries, support downstream routing (UnifiedMLLM), or enable fine-grained spatial control (GETok) (Li et al., 2024, Ren et al., 11 Dec 2025).
- Hybrid Discrete-Continuous Reasoning: Certain autoregressive MLLMs jointly generate both discrete tokens (text, special markers) and continuous embeddings ("visual thoughts," motion latents) under a unified likelihood (Tong et al., 5 Feb 2026, Li et al., 12 Feb 2026).
This hybridization unlocks the model's ability to ground, reason, and generate in modality-appropriate representations, while maintaining autoregressive tractability.
3. Training Paradigms and Supervision
Training autoregressive MLLMs requires both multi-modal data and supervisory objectives that bridge language and non-language domains:
- Unified Losses: Most frameworks sum cross-entropy losses over all tokens (text, modality-specific) in the sequence. Continuous modalities typically introduce regression (L2, MSE, or flow-matching) losses on continuous tokens (Tong et al., 5 Feb 2026, Li et al., 12 Feb 2026).
- Staged Curriculum: Progressive pipelines are common—starting with modality alignment (captioning/text-only pretraining), followed by instruction tuning, and, in some cases, expert fine-tuning or reinforcement learning for grounding/localization (Li et al., 2024, Ren et al., 11 Dec 2025, Peng et al., 2024).
- Adapters/Parameter-Efficient Updates: LoRA, MoE, or head-specific updates enable models to learn modality integration and generation while largely freezing core LLM or encoder parameters, guarding against catastrophic forgetting (Cappellazzo et al., 2024, Sun et al., 9 Mar 2025).
- Automated Chain-of-Thought and Reasoning Mode Curation: For tasks needing compositional or spatio-temporal reasoning, automated generation of chain-of-thought traces (via auxiliary VLMs) or dynamic reasoning-mode supervision (text-only, vision-only, interleaved) guides the model's internal rollout strategy (Wang et al., 9 Mar 2026, Tong et al., 5 Feb 2026).
4. Inference Procedures and Interaction Dynamics
Inference in autoregressive MLLMs hinges on task- and modality-aware decoding, including:
- Stability vs. Creativity via Decoding State: Finite-state decoding automata flexibly choose between deterministic (greedy) and generative (sampling) modes, essential for traversing tasks such as transcription (ASR/TTS), open-ended generation (T2I), and interactive dialog (Cheng et al., 25 Jan 2026).
- Dynamic Mode Switching: Hybrid models (e.g., SwimBird) learn to switch among pure text, pure vision, and interleaved modes, deciding positionally when to emit discrete tokens or continuous embeddings based on the input and prompt (Tong et al., 5 Feb 2026).
- Task Routing and Expert Selection: Models such as UnifiedMLLM emit explicit task and grounding tokens that are parsed by a routing function, which then dispatches context and arguments to downstream expert modules (classifiers, segmenters, local image editors) (Li et al., 2024).
- Spatial and Temporal Chaining: For tasks such as trajectory forecasting and motion generation, sequence tokenization (with point or latent tokens) ensures that each output step is conditioned only on the causal past (including visual scene, prior predictions, and goal metadata), supporting flexible horizon and long-range temporal coherence (Wang et al., 9 Mar 2026, Li et al., 12 Feb 2026).
5. Empirical Performance and Capabilities
Autoregressive MLLMs have demonstrated strong performance across a spectrum of multimodal tasks, including:
- Multimodal Understanding: Vision-language understanding (VQA, captioning, OCR, grounding) at or above the level of established baselines. For instance, LIQUID achieves 68.0 VQAv2 and 56.1 GQA scores in the zero-shot setting (Wu et al., 2024).
- Multimodal Generation: High-fidelity text-to-image (e.g., Liquid FID=5.47, DDT-LLaMA GenEval=0.66), speech synthesis (AR-Omni real-time factor=0.88), streaming motion generation (>30 FPS in LLaMo), and flexible-length trajectory generation (AutoTraces IEAcc=99.92%) (Wu et al., 2024, Pan et al., 20 Apr 2025, Cheng et al., 25 Jan 2026, Wang et al., 9 Mar 2026).
- Spatial Reasoning and Grounding: GETok significantly outperforms patch/box representations for referring tasks ([email protected] ≈ 88.2%), while grid/offset tokens yield precise, iterative localization (Ren et al., 11 Dec 2025).
- Task Generalization and Scalability: UnifiedMLLM attains SOTA cIoUs on referring segmentation (RefCOCO=76.3%), multi-task alignment (Joint Autoregressive Mixture, Liquid), and task extensibility via modular expert routing (Li et al., 2024, Aiello et al., 2023).
- Robustness and Efficiency: Models employing efficient adapters, frozen encoder/backbone strategies, or unified embedding spaces show reduced catastrophic forgetting and lower compute/memory cost (e.g., ARMOR: only ~0.7B new params, order-of-magnitude less compute vs. unified from-scratch) (Sun et al., 9 Mar 2025).
6. Limitations and Prospective Directions
Current limitations of autoregressive MLLMs include:
- Inference Cost and Sequence Lengths: Incorporation of large VLM/LLM backbones and long modality streams can cause latency bottlenecks, especially in real-time or streaming contexts (Wang et al., 9 Mar 2026, Cheng et al., 25 Jan 2026).
- Quantization and Fidelity Gaps: Discrete codebook-based approaches may suffer from reconstruction artifacts or insufficient granularity (noted for images, speech, and motion), though diffusion-timestep tokens and continuous latent models mitigate some of these issues (Pan et al., 20 Apr 2025, Li et al., 12 Feb 2026).
- Physics and Dynamic Constraint Oversight: Models generating physical trajectories or motion often do not enforce dynamic or kinodynamic admissibility unless explicitly constrained or regularized (Wang et al., 9 Mar 2026).
- Expert Dependence: Some frameworks require routing to external or expert models for particular tasks, potentially limiting true “unification” (Li et al., 2024).
- Scaling and Cross-Modality Interference: Earlier models observed that joint training degraded text task performance at small scale, though scaling laws in Liquid show this effect vanishes at larger model sizes (Wu et al., 2024).
Future directions highlighted include the integration of physics-informed modules, generalized multi-agent contexts, plug-and-play extension to new modalities (video, audio, 3D), and improvements in tokenization or decoder sharing to bridge fidelity gaps relative to diffusion models (Wang et al., 9 Mar 2026, Wu et al., 2024, Cheng et al., 25 Jan 2026). The paradigm of recursive, order-sensitive "visual languages" and learnable spatial token vocabularies opens new avenues for grounded reasoning and RL-compatible multimodal agents (Pan et al., 20 Apr 2025, Ren et al., 11 Dec 2025).
Selected References
- AutoTraces: Autoregressive Trajectory Forecasting via Multimodal LLMs (Wang et al., 9 Mar 2026)
- UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With LLM (Li et al., 2024)
- SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs (Tong et al., 5 Feb 2026)
- LLaMo: Scaling Pretrained LLMs for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens (Li et al., 12 Feb 2026)
- Liquid: LLMs are Scalable and Unified Multi-modal Generators (Wu et al., 2024)
- AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation (Cheng et al., 25 Jan 2026)
- ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability (Sun et al., 9 Mar 2025)
- Jointly Training Large Autoregressive Multimodal Models (Aiello et al., 2023)
- Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens (Pan et al., 20 Apr 2025)
- Grounding Everything in Tokens for Multimodal LLMs (Ren et al., 11 Dec 2025)
- Multi-modal Auto-regressive Modeling via Visual Words (Peng et al., 2024)
- MammothModa: Multi-Modal LLM (She et al., 2024)
- X-LLM: Bootstrapping Advanced LLMs by Treating Multi-Modalities as Foreign Languages (Chen et al., 2023)
- LLMs are Strong Audio-Visual Speech Recognition Learners (Cappellazzo et al., 2024)