Autoregressive Unified Models

Updated 26 November 2025

Autoregressive unified models are frameworks that process diverse modalities—text, images, video, and audio—as token sequences via a single decoder-only transformer.
They leverage modality-specific tokenizers and joint autoregressive factorization to predict the next token, yielding competitive performance across multiple domains.
Future directions focus on optimizing tokenization strategies, balancing modality distributions, and integrating reinforcement objectives for improved cross-domain robustness.

Autoregressive unified models are a paradigm in machine learning that employ a single autoregressive backbone to handle heterogeneous modalities and tasks—such as image generation, image understanding, video synthesis, speech processing, scientific data generation, text-to-SQL parsing, or financial risk estimation—by representing all target outputs and conditions as sequences of tokens. These tokens can be discrete or continuous and can encode language, images, video, audio, tabular, or scientific data. The models are unified in that they use a single sequence-processing architecture and training regime, factorizing the joint data distribution autoregressively and optimizing primarily the next-token prediction likelihood. Recent advances have enabled competitive or state-of-the-art results across text, vision, audio, and structured data tasks using a single autoregressive transformer framework (Fan et al., 17 Mar 2025 Wang et al., 5 Aug 2025 Tang et al., 27 Mar 2025 Lu et al., 2023).

1. Architectural Principles and Model Design

Autoregressive unified models integrate text and non-text modalities into a single processing sequence, typically using a decoder-only transformer (LLM, such as Gemma, LLaMA, Qwen, Vicuna) as the backbone. Text tokens are handled with conventional tokenizers (e.g., SentencePiece, BPE), while non-text modalities are converted into sequences of tokens through modality-specific tokenizers:

Images: Tokenization via vector-quantized autoencoders (VQ-VAE, VQGAN), continuous VAEs, or self-supervised encoders with K-means codebooks (e.g., DiGIT) (Fan et al., 17 Mar 2025 Zhu et al., 2024).
Video: Sequential autoregressive factorization over spatial and temporal quantized blocks or frames, using hierarchical pyramids or frame-grouped tokenization (Liu et al., 6 Nov 2025 Yuan et al., 11 Jul 2025).
Audio/Speech: Use of neural audio codecs (e.g., BiCodec) for discretizing waveforms; self-supervised encoders such as WavLM condition the transformer (Yan et al., 23 Oct 2025).
Scientific/Structured Data: Mixed sequences comprise both discrete (symbolic) and continuous (numeric) tokens, handled with dual output heads (next-token prediction for symbols; a conditional diffusion head for numbers) (Zhang et al., 9 Mar 2025).

Tokens from all modalities are concatenated into a single sequence with special delimiter tokens and are processed autoregressively. Model internals frequently employ shared embedding spaces (with or without linear projections for visual features), rotary positional encodings (1D for text, 2D or 3D for vision/video), and blockwise or causal/masked attention to control receptive field dependencies (Wang et al., 5 Aug 2025 Yuan et al., 11 Jul 2025 Lu et al., 2023).

2. Joint Autoregressive Factorization

The fundamental probabilistic formulation is a joint factorization of all output tokens, regardless of modality:

$p(z_1,\ldots,z_T) = \prod_{t=1}^T p(z_t \mid z_{<t})$

where each $z_t$ may be a text, visual, audio, or structure/action token. For multimodal models, specific orderings or schemes dictate prefix, interleaving, or block-wise grouping of modalities. For example, in text-image generation (Fan et al., 17 Mar 2025):

$p(x_{1:T}, y_{1:N}) = \prod_{t=1}^T p(x_t | x_{<t}, y_{<t}) \cdot \prod_{n=1}^N p(y_n | x_{1:T}, y_{<n})$

For unified video generators, this expands over time (clips/frames) and space (multiscale blocks), e.g., in InfinityStar:

$p(\{r_k^c\}) = \prod_{c=1}^N \prod_{k=1}^K p(r_k^c | r_{<k}^c, r_{1:K}^{<c}, \psi(t))$

where $r_k^c$ represents spatial block tokens at scale $k$ in clip $c$ , and $\psi(t)$ is an optional conditioning signal (e.g., text prompt) (Liu et al., 6 Nov 2025).

In unified scientific models (UniGenX), the sequence comprises both discrete and continuous tokens, each handled by the appropriate model head and objective (Zhang et al., 9 Mar 2025).

3. Modality Tokenization and Encoding Strategies

Autoregressive unified models require bespoke tokenization strategies for each modality to ensure compatibility with discrete sequence models and vocabulary alignment:

Discrete VQ or K-means Tokenization: Images are divided into patches, then each patch feature is quantized, either via a learned vector-quantized codebook or K-means clustering over self-supervised features (e.g., DINOv2). This yields index sequences suitable for transformers (Zhu et al., 2024 Tang et al., 27 Mar 2025).
Continuous Latent Encoding: Some frameworks (e.g., UniFluid) use a VAE-style encoder to map images into continuous latent tokens, generating these values with a per-token diffusion MLP head, thus sidestepping codebook instability and cumulative quantization error (Fan et al., 17 Mar 2025).
Semantic/Contrastive Tokens: Visual encoders trained for alignment (CLIP, SigLIP) project images into semantic embeddings then mapped (possibly via adapters or MLPs) to the LLM token space (Wang et al., 5 Aug 2025).
Video/Spacetime Factoring: To efficiently encode video, residual pyramids or frame-wise grouping is applied, often with special attention masks enforcing intra-frame bidirectionality and inter-frame causality, supported by 3D or blockwise rotary positional embeddings (Yuan et al., 11 Jul 2025 Liu et al., 6 Nov 2025).
Progressive Vocabulary Learning: To prevent training collapse due to the introduction of a large number of novel visual tokens, techniques such as progressive vocabulary activation are used, incrementally adding new visual IDs throughout training for stable convergence (Tang et al., 27 Mar 2025).

Tokenization and vocabulary strategy directly affect the sequence length, modality balance, and cross-modal transfer.

4. Training Objectives, Loss Balancing, and Optimization

Autoregressive unified models use standard maximum likelihood for next-token prediction, usually the cross-entropy loss summed across all tokens in the sequence. For mixed continuous/discrete outputs or hybrid modalities, additional objectives are incorporated:

Weighted Multitask Losses: Loss components for each modality or task (e.g., text cross-entropy, visual diffusion loss) are combined, with weights ( $\lambda$ ) controlling the trade-off between, for example, image understanding and image generation (Fan et al., 17 Mar 2025):

$L = L_{\text{visual}} + \lambda \cdot L_{\text{text}}$

Tuning $\lambda$ yields explicit control over performance trade-offs.

Diffusion/Denoising Objectives: For models outputting continuous tokens or high-fidelity images, a conditional diffusion loss is applied per token or per block (Fan et al., 17 Mar 2025 Wang et al., 5 Aug 2025 Zhang et al., 9 Mar 2025). In D-AR, the image generation process itself is linearized into a pure AR chain, allowing diffusion to be modeled purely as next-token prediction over a specially-aligned token sequence (Gao et al., 29 May 2025).
Guidance, Distillation, and Reward Augmentation: For conditional generation, classifier-free guidance is employed by interpolating between unconditional and conditional logits. Knowledge distillation from frozen vision models (DINOv2, CLIP) aligns AR representations with semantic targets for better instruction adherence (Mu et al., 8 Jan 2025). Reward models may further reweight editing or generation targets in multimodal settings (Wang et al., 5 Aug 2025).
Curriculum and Order Randomization: Random order generation of image tokens during early training, annealed to raster order, avoids collapse artifacts and improves coverage of local context (Fan et al., 17 Mar 2025).

5. Applications and Empirical Results

Autoregressive unified models have demonstrated high performance across a spectrum of domains and benchmarks:

Text-to-Image, Image Generation, and Editing: Models such as UniFluid, UGen, UniPic, EditAR, and VARGPT achieve FID, GenEval, and editing scores that rival or surpass diffusion-based and task-specific baselines, often requiring only a single transformer backbone for text and vision (Fan et al., 17 Mar 2025 Wang et al., 5 Aug 2025 Mu et al., 8 Jan 2025 Tang et al., 27 Mar 2025 Zhuang et al., 21 Jan 2025).
Visual Understanding: Captioning, VQA, and reasoning metrics (e.g., CapAvg, QAAvg, MMMU, MMBench) are competitive with specialized LLMs or vision transformers. Unified AR training sometimes incurs a small penalty relative to I2T-specific models but preserves transferability and downstream flexibility (Fan et al., 17 Mar 2025 Tang et al., 27 Mar 2025 Zhuang et al., 21 Jan 2025).
Video Generation: Autoregressive video generators (InfinityStar, Lumos-1) achieve state-of-the-art or superdiffusive performance for 720p and higher resolutions with greatly improved sampling speeds over iterative diffusion pipelines, by using multiscale token blocks and sparse or 3D positional attention (Liu et al., 6 Nov 2025 Yuan et al., 11 Jul 2025).
Speech, Audio, and Structured Data: In tasks such as joint ASR+attribute estimation or speech restoration/separation, AR unified models outperform or match discriminative and generative baselines, integrating multiple speech tasks in a single inference pass (Yan et al., 23 Oct 2025 Masumura et al., 2021). In scientific data, UniGenX provides unified generation of sequences (formulas, SMILES) and structures (coordinates, energies) via AR+diffusion fusion, outmatching prior SOTA on molecule/material benchmarks (Zhang et al., 9 Mar 2025).
Multimodal Integration: Unified-IO 2 extends AR modeling to audio, vision, language, robotics, and sparse annotation with robust scaling and stability enhancements, reporting high average scores across >35 benchmarks (Lu et al., 2023).
Text-to-SQL and Financial Modeling: Extension to structured data, such as UniSAr for text-to-SQL parsing and unified Bayesian AR risk models, demonstrates generalization to classical sequence and regression domains (Dou et al., 2022 Bottone et al., 2019).

A sample performance table:

Model	Text Acc.	Img Und.	Img Gen.	Video FID	Apps
UGen	46.1	66.0	52.1	—	Text, VQA, T2I
UniFluid	—	96% I2T	FID 7.2	—	T2I, VQA, Edit
UniPic	—	—	0.86 GE	—	T2I, Edit
InfinityStar	—	—	—	83.74 (VB)	T2I, T2V, I2V
Unified-IO 2	75.2 Cat	71.1 VQA	FID13.4	—	VQA, Audio, Rob.

Metrics are reported as in the respective papers (Acc., FID↓, GenEval↑, VBench), illustrating competitive results for unified AR models (Fan et al., 17 Mar 2025 Wang et al., 5 Aug 2025 Lu et al., 2023).

6. Current Challenges and Future Directions

Despite empirical progress, several challenges persist:

Sequence Length and Quadratic Scaling: High-resolution images and videos produce thousands of tokens, taxing transformer memory and computation. Proposed solutions include next-patch/block prediction, sparse or hierarchical attention, and blockwise AR grouping (Zhang et al., 5 May 2025 Liu et al., 6 Nov 2025).
Modality Imbalance: Discrete visual tokens may greatly outnumber text tokens, biasing training and model capacity toward vision tasks. Dynamic token weighting and progressive curriculum strategies mitigate this effect (Tang et al., 27 Mar 2025).
Tokenization Trade-offs: Discrete codebooks must balance generative stability (low quantization error, suppressed token flip rate) against semantic information content; continuous tokens improve generation but complicate downstream tasks (Zhu et al., 2024 Fan et al., 17 Mar 2025).
Hybrid AR+Diffusion Approaches: Hybrid models, blending AR LLMs with diffusion/denoising heads, offer improved numerical precision and fidelity at the cost of additional architectural complexity and hyperparameter tuning (Fan et al., 17 Mar 2025 Zhang et al., 9 Mar 2025).
Integration of New Modalities: Audio, dense annotations, robotics, and actions require elaborated tokenization and prompt engineering to fully exploit AR model benefits (Lu et al., 2023).
Unified Benchmarking and End-to-End Cycles: Evaluations often isolate understanding or generation, but comprehensive "reasoning→generation" or "dialogue+edit" cycles remain underexplored for AR unified models (Zhang et al., 5 May 2025).

7. Significance and Outlook

Autoregressive unified models constitute a conceptually elegant and empirically robust class of architectures, capable of spanning natural language, vision, audio, scientific, and structured data tasks within a single flexible transformer architecture. Their principal advantages are simplicity of the inference paradigm (left-to-right sequence prediction), seamless cross-modal integration, and consolidated parameter/compute footprints, with domain performance approaching or exceeding state-of-the-art specialized or diffusion-based competitors as models and data scale (Fan et al., 17 Mar 2025 Wang et al., 5 Aug 2025 Lu et al., 2023).

Research continues to improve tokenization strategies, cross-modal alignment, sequence efficiency, and domain-generalization abilities. Future prospects include scaling up to additional modalities (e.g., haptics, 3D), further architectural simplification (e.g., decoder-only stacks for all tasks), and deeper integration of reinforcement/reward-based objectives. The consistent trend is toward a universal AR modeling backbone, trained end-to-end, that naturally unifies generation and understanding across all known data types (Zhang et al., 5 May 2025).