Unified Decoder-Only Models

Updated 3 December 2025

Unified decoder-only models are autoregressive transformer architectures that integrate multiple modalities using a single causally-masked stack, replacing traditional encoder-decoder designs.
They employ modality-specific tokenizers and adapters to align text, vision, speech, and other inputs into a unified representation for joint training and efficiency.
Recent benchmarks indicate these models match or surpass state-of-the-art performance in vision-language, speech processing, and time-series tasks, highlighting their broad applicability.

Unified decoder-only models implement a single autoregressive transformer-based architecture across disparate modalities and tasks, replacing the historical encoder-decoder separation with a shared, causally-masked transformer stack. This universalization permits end-to-end alignment and joint modeling of text, images, speech, video, trajectories, and scientific time series, leveraging tokenization, embedding, and flexible objective design. Recent work demonstrates that decoder-only architectures now match or surpass traditional multi-module stacks for vision-language, speech processing, audio enhancement, multimodal understanding, scientific simulation, and sequence forecasting.

1. Core Architectural Principles

Unified decoder-only models utilize an autoregressive transformer with activation masking to ensure causal flow, typically parameterized as layers of multi-head self-attention and feedforward submodules. A shared input embedding matrix covers all token types: text, vision, speech, and other modalities, possibly with adapter layers or mixture-of-experts (MoE) for modality specialization.

For vision-text models (MUDAIF, OneCAT), modality-specific adapters (e.g., Vision-Token Adapter in MUDAIF) or lightweight convolutional patch embedders (OneCAT) convert raw images to pseudo-text tokens, which are appended or interleaved with text tokens. In speech and audio (UniSE, VoxtLM), discrete neural audio codec tokens or self-supervised speech embeddings are incorporated into a shared vocabulary and sequence. For cross-modal or temporal series (DONUT, Visatronic), sequences of structured tokens from each modality are concatenated and attended with task-specific positional encodings.

Some architectures introduce further specialization within the decoder. OneCAT, for example, uses a hard-gated modality-specific MoE within each decoder layer, while maintaining parameter sharing through shared attention (Li et al., 3 Sep 2025). The MUDAIF model employs an adaptive co-attention mechanism for bi-directional fusion between tokens in different modalities (Tanaka et al., 14 Dec 2024).

2. Training Objectives and Unified Losses

All unified decoder-only models are trained largely via standard next-token autoregressive cross-entropy over the full concatenated token sequence:

$\mathcal{L}_{\mathrm{AR}} = -\sum_{i=1}^{T} \log p(x_i \mid x_{<i})$

where $x_i$ ranges over all tokens (text, vision, speech, etc). For tasks requiring multitask or conditional outputs, task tokens or instruction prefixes disambiguate the intended generation (UniSE uses a unique task token for each of speech restoration, target speaker extraction, and speech separation; VoxtLM employs <generate-speech> and <generate-text> markers) (Yan et al., 23 Oct 2025, Maiti et al., 2023).

When tokenization introduces label noise or multimodality (as with discrete speech tokens), objectives are sometimes regularized with KL divergence against smoothed labels (SLD) or label smoothing (Chen et al., 2023). For trajectory prediction and time-series (DONUT, PDE decoders), domain-specific negative log-likelihoods are used—e.g., Laplace or von Mises mixtures for trajectory modes (Knoche et al., 7 Jun 2025), and MSE or normalized RMSE for scientific simulation (García-de-Herreros et al., 6 Oct 2025).

No explicit contrastive or image–text matching losses are required for unimodal fusion (MUDAIF, OneCAT), with alignment achieved through direct joint autoregressive modeling (Tanaka et al., 14 Dec 2024, Li et al., 3 Sep 2025).

Unified decoder-only models achieve cross-modal fusion by co-embedding all tokens and leveraging the full transformer stack for information propagation. Integration patterns include:

Vision-Language: MUDAIF’s adaptive co-attention operates in both directions between vision and text, with fusion outputs added into the residual stream at each decoder layer (Tanaka et al., 14 Dec 2024). OneCAT routes tokens through expert FFNs while keeping shared attention across the merged token stream (Li et al., 3 Sep 2025).
Speech-Text: VoxtLM, Visatronic, and other models tokenize both modalities and concatenate their streams, often with careful alignment or interleaving strategies to ensure temporal and semantic coherence (Gupta et al., 26 Nov 2024, Maiti et al., 2023).
Video–Text–Speech: Visatronic (VTTS) sequences can be ordered to experiment with context propagation (e.g., all video then text, or text then video, or streaming alignment), showing that joint modeling of temporally aligned tokens enables superior phoneme-level timing and recognition (Gupta et al., 26 Nov 2024).
Scientific/Trajectory Prediction: All input (history, context, scene, etc.) and autoregressively generated future tokens are represented uniformly—DONUT attends simultaneously to previous trajectory segments and environmental (road, lane, other agents) features (Knoche et al., 7 Jun 2025). For time-series, parallel flipping and sequence doubling introduce effective bidirectionality without modifying model code (García-de-Herreros et al., 6 Oct 2025).

Unified decoder-only architectures offer consistent efficiency and scaling advantages by eliminating modality-specific encoders, projectors, or auxiliary modules. For instance:

MUDAIF is ≈20% faster in training and ≈1.5× higher inference throughput at high resolutions compared to dual-encoder vision-language architectures (Tanaka et al., 14 Dec 2024).
OneCAT reduces time-to-first-token by 50–60% and achieves 5–10× faster text-to-image generation versus diffusion or multi-module stacks, especially at high resolutions (Li et al., 3 Sep 2025).
Parameter-efficient variants (UniSE at 63M params vs. >1B for discriminative baselines) perform comparably across speech restoration, separation, and extraction (Yan et al., 23 Oct 2025).
Model variants in the scientific domain close most of the encoder–decoder performance gap via parallel flipping or sequence doubling, with no new parameters and modest computational overhead (García-de-Herreros et al., 6 Oct 2025).

A single transformer stack suffices for all modalities and tasks; only the embedding/adaptation layer or a lightweight MoE differentiates token branches. This simplifies parameter alignment, reduces FLOPs, and enables unified scaling across domains.

5. Evaluation, Benchmarks, and Task Generalization

Unified decoder-only models now set or match state of the art in a broad spectrum of benchmarks:

Task/Benchmark	Model	Metric / Value(s)
Vision–Language (VQA)	MUDAIF	VQA-v2 Acc.: 80.3 (vs. SOTA ≤78.7)
Image Captioning	MUDAIF	BLEU: 0.78 (vs. SOTA ≤0.74)
Multimodal Gen+Edit	OneCAT	GenEval: 0.90; ImgEdit-Bench: 3.43
Speech Restoration	UniSE	DNSMOS OVRL: 3.40 (vs. SOTA ≤3.33)
Trajectory Forecast	DONUT	b-minFDE_6: 1.79m (SOTA)
PDE Time-series	Decoder-only + seqdouble	nRMSE (Advection): 0.024 (vs. 0.022 for RoBERTa, 0.078 for GPT-2 baseline)

Task unification consistently yields performance improvements relative to single-task models: e.g., VoxtLM improves speech synthesis CER from 28.9 to 5.6 and MOS from 2.68 to 3.90 in multi-task (speech+text) training (Maiti et al., 2023), while unified models in vision-language and speech enhancement maintain or exceed SOTA regardless of which modality or task is active at inference.

Ablation studies demonstrate that critical model components (e.g., vision-token adapters, co-attention, MoE routing, token mixing strategies) are essential for optimal performance and cross-modal generalization (Tanaka et al., 14 Dec 2024, Li et al., 3 Sep 2025, Gupta et al., 26 Nov 2024).

6. Methodological Innovations and Domain Adaptation

Key innovations in unified decoder-only modeling include:

Vision-Token Adapter, Adaptive Co-Attention (MUDAIF): Directly fuses image and text without pre-processing encoders, achieving both alignment and computational efficiency (Tanaka et al., 14 Dec 2024).
PatchEmbedding and Modality-specific MoE (OneCAT): Bypasses tokenization overhead and dynamically routes computation for each token, enabling efficient support for high-resolution and dynamic input sizes (Li et al., 3 Sep 2025).
Parallel Flipping / Sequence Doubling (PDE adaptation): Simulates bidirectional context in causally-masked transformers, closing much of the gap with encoder-based models for time-dependent scientific problems (García-de-Herreros et al., 6 Oct 2025).
Unified AR Multitask Learning (VoxtLM, UniSE): Simple token- or instruction-based conditioning in the context stream, reusing the same model weights for varied tasks and target modalities (Maiti et al., 2023, Yan et al., 23 Oct 2025).
Content-aware Embedding (Causal2Vec): Appends dynamically constructed context tokens to enable strong embedding performance without changing the core LLM attention mask (Lin et al., 31 Jul 2025).

These strategies are domain-agnostic and offer direct generalization opportunities—for instance, DONUT’s overprediction technique can be incorporated into any autoregressive domain predictor to encourage long-range prospection (Knoche et al., 7 Jun 2025), and task-token-based unification extends to new tasks by simple vocabulary augmentation (UniSE) (Yan et al., 23 Oct 2025).

7. Limitations and Outlook

Current unified decoder-only models face several limitations. In speech separation, scalability to mixtures with more than two speakers is not yet demonstrated (UniSE) (Yan et al., 23 Oct 2025). For scientific time-series, encoder-only architectures still show marginally better absolute errors unless bidirectionality is artificially restored (García-de-Herreros et al., 6 Oct 2025). Step-by-step autoregressive generation can limit throughput compared to parallel or discriminative models in some domains. Performance is also contingent on the quality of upstream tokenizers (e.g., neural audio codecs for speech), and some architectures remain heavily reliant on careful token mixing or alignment.

A plausible implication is that future work will focus on integrating more complex cross-modal fusion mechanisms, refining multimodal tokenization strategies, and identifying low-overhead methods to simulate bidirectional contextualization. Continued scaling, modular specialization (e.g., adaptive routing in MoE), and broad pretraining of unified decoder-only architectures are likely to drive further gains in efficiency, generality, and robustness across modalities.

References: MUDAIF (Tanaka et al., 14 Dec 2024), UniSE (Yan et al., 23 Oct 2025), DONUT (Knoche et al., 7 Jun 2025), Causal2Vec (Lin et al., 31 Jul 2025), Visatronic (Gupta et al., 26 Nov 2024), VoxtLM (Maiti et al., 2023), OneCAT (Li et al., 3 Sep 2025), PDE decoders (García-de-Herreros et al., 6 Oct 2025), SLD (Chen et al., 2023).