Papers
Topics
Authors
Recent
Search
2000 character limit reached

Autoregressive Multimodal LLMs

Updated 2 June 2026
  • Autoregressive Multimodal LLMs are unified generative models that extend next-token prediction to various modalities like text, images, and audio.
  • They employ a single decoder-only Transformer enhanced by modality-specific adapters, projection layers, and mixture-of-experts for coherent output.
  • Training leverages unified losses and staged curricula that enable robust multimodal reasoning, spatial grounding, and dynamic mode switching.

Autoregressive Multimodal LLMs (MLLMs) are unified generative models that extend the autoregressive next-token prediction paradigm of LLMs to encompass multiple data modalities—such as text, images, audio, video, motion, and coordinate sequences. These models are designed to ingest, process, and generate complex multimodal sequences in a left-to-right causal fashion, leveraging shared or extended vocabularies and unified embedding spaces. Key innovations span from unified tokenization schemes (spatial, discrete, or continuous) and modality-specific architectural adaptations to dynamic reasoning mode switching, automated cross-modal grounding, and scalable training regimes that preserve both general language and modality-specific capabilities.

1. Architectural Foundations and Unified Tokenization

Autoregressive MLLMs rely on a single backbone Transformer, typically decoder-only, whose core next-token prediction objective is extended to accommodate new modalities via unified or specialized tokenization approaches. Depending on the model, input modalities are discretized (e.g., VQ‐VAE visual tokens, "visual words"), projected into the language space via adapters or X2L interfaces, or represented as continuous embeddings (e.g., causal VAE latents for motion). Token streams are concatenated for joint attention, often with specialized boundary, modality, or task tokens (Wu et al., 2024, Ren et al., 11 Dec 2025).

Major tokenization approaches include:

Approach Description Key Models
Spatial Discretization VQ‐VAE/VQGAN codes, patch tokens Liquid, AR-Omni, JAM
Visual Words Map image patches to text-vocab distributions VW-LMM
Diffusion Timestep Tokens Recursive, order-sensitive visual language DDT-LLaMA
Point Tokens/Embeddings Continuous/embedded waypoints for trajectories AutoTraces
Audio/Speech Quantization Single-codebook acoustic tokens AR-Omni, Llama-AVSR
Causal Continuous Latents Streaming continuous embeddings for motion LLaMo
Explicit Spatial Tokens Grid and offset tokens for 2D reasoning GETok

This design allows a unified decoder to process arbitrary interleavings of text and non-text modalities, supporting "any-to-any" generation by virtue of a joint vocabulary and/or interleaved sequence—without requiring modality-specific decoders (Cheng et al., 25 Jan 2026, Wu et al., 2024).

2. Modality Integration and Representation Strategies

Modality integration is achieved through a combination of architectural and representational enhancements:

This hybridization unlocks the model's ability to ground, reason, and generate in modality-appropriate representations, while maintaining autoregressive tractability.

3. Training Paradigms and Supervision

Training autoregressive MLLMs requires both multi-modal data and supervisory objectives that bridge language and non-language domains:

4. Inference Procedures and Interaction Dynamics

Inference in autoregressive MLLMs hinges on task- and modality-aware decoding, including:

  • Stability vs. Creativity via Decoding State: Finite-state decoding automata flexibly choose between deterministic (greedy) and generative (sampling) modes, essential for traversing tasks such as transcription (ASR/TTS), open-ended generation (T2I), and interactive dialog (Cheng et al., 25 Jan 2026).
  • Dynamic Mode Switching: Hybrid models (e.g., SwimBird) learn to switch among pure text, pure vision, and interleaved modes, deciding positionally when to emit discrete tokens or continuous embeddings based on the input and prompt (Tong et al., 5 Feb 2026).
  • Task Routing and Expert Selection: Models such as UnifiedMLLM emit explicit task and grounding tokens that are parsed by a routing function, which then dispatches context and arguments to downstream expert modules (classifiers, segmenters, local image editors) (Li et al., 2024).
  • Spatial and Temporal Chaining: For tasks such as trajectory forecasting and motion generation, sequence tokenization (with point or latent tokens) ensures that each output step is conditioned only on the causal past (including visual scene, prior predictions, and goal metadata), supporting flexible horizon and long-range temporal coherence (Wang et al., 9 Mar 2026, Li et al., 12 Feb 2026).

5. Empirical Performance and Capabilities

Autoregressive MLLMs have demonstrated strong performance across a spectrum of multimodal tasks, including:

  • Multimodal Understanding: Vision-language understanding (VQA, captioning, OCR, grounding) at or above the level of established baselines. For instance, LIQUID achieves 68.0 VQAv2 and 56.1 GQA scores in the zero-shot setting (Wu et al., 2024).
  • Multimodal Generation: High-fidelity text-to-image (e.g., Liquid FID=5.47, DDT-LLaMA GenEval=0.66), speech synthesis (AR-Omni real-time factor=0.88), streaming motion generation (>30 FPS in LLaMo), and flexible-length trajectory generation (AutoTraces IEAcc=99.92%) (Wu et al., 2024, Pan et al., 20 Apr 2025, Cheng et al., 25 Jan 2026, Wang et al., 9 Mar 2026).
  • Spatial Reasoning and Grounding: GETok significantly outperforms patch/box representations for referring tasks ([email protected] ≈ 88.2%), while grid/offset tokens yield precise, iterative localization (Ren et al., 11 Dec 2025).
  • Task Generalization and Scalability: UnifiedMLLM attains SOTA cIoUs on referring segmentation (RefCOCO=76.3%), multi-task alignment (Joint Autoregressive Mixture, Liquid), and task extensibility via modular expert routing (Li et al., 2024, Aiello et al., 2023).
  • Robustness and Efficiency: Models employing efficient adapters, frozen encoder/backbone strategies, or unified embedding spaces show reduced catastrophic forgetting and lower compute/memory cost (e.g., ARMOR: only ~0.7B new params, order-of-magnitude less compute vs. unified from-scratch) (Sun et al., 9 Mar 2025).

6. Limitations and Prospective Directions

Current limitations of autoregressive MLLMs include:

  • Inference Cost and Sequence Lengths: Incorporation of large VLM/LLM backbones and long modality streams can cause latency bottlenecks, especially in real-time or streaming contexts (Wang et al., 9 Mar 2026, Cheng et al., 25 Jan 2026).
  • Quantization and Fidelity Gaps: Discrete codebook-based approaches may suffer from reconstruction artifacts or insufficient granularity (noted for images, speech, and motion), though diffusion-timestep tokens and continuous latent models mitigate some of these issues (Pan et al., 20 Apr 2025, Li et al., 12 Feb 2026).
  • Physics and Dynamic Constraint Oversight: Models generating physical trajectories or motion often do not enforce dynamic or kinodynamic admissibility unless explicitly constrained or regularized (Wang et al., 9 Mar 2026).
  • Expert Dependence: Some frameworks require routing to external or expert models for particular tasks, potentially limiting true “unification” (Li et al., 2024).
  • Scaling and Cross-Modality Interference: Earlier models observed that joint training degraded text task performance at small scale, though scaling laws in Liquid show this effect vanishes at larger model sizes (Wu et al., 2024).

Future directions highlighted include the integration of physics-informed modules, generalized multi-agent contexts, plug-and-play extension to new modalities (video, audio, 3D), and improvements in tokenization or decoder sharing to bridge fidelity gaps relative to diffusion models (Wang et al., 9 Mar 2026, Wu et al., 2024, Cheng et al., 25 Jan 2026). The paradigm of recursive, order-sensitive "visual languages" and learnable spatial token vocabularies opens new avenues for grounded reasoning and RL-compatible multimodal agents (Pan et al., 20 Apr 2025, Ren et al., 11 Dec 2025).


Selected References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Autoregressive Multimodal Large Language Models (MLLMs).