Papers
Topics
Authors
Recent
Search
2000 character limit reached

OmniLLM: Unified Multimodal Language Model

Updated 10 February 2026
  • OmniLLM is a unified, transformer-based model that processes multi-modal token streams across text, vision, audio, and speech.
  • It employs a single shared decoder and joint tokenization with explicit modality markers, eliminating the need for separate expert modules.
  • Empirical evaluations demonstrate competitive performance in image captioning, ASR, and text-to-image tasks, while suggesting areas for future optimization.

OmniLLM models are a unifying class of large-scale, transformer-based multi-modal language architectures designed to process and generate arbitrary sequences involving text, vision, audio, and speech by operating directly on interleaved, modality-marked token streams. These systems offer a direct route to "any-to-any" generation—predicting outputs in one or more modalities conditioned on any combination of inputs—within a single, unified decoder without dependence on auxiliary expert modules, separate diffusion engines, or cascaded fusion pipelines (Cheng et al., 25 Jan 2026). The paradigm is motivated by real-world perception, which is inherently multi-sensory and sequential, and by the need to enable agents capable of both interacting across modalities and proactively reasoning over dynamic, streaming inputs (Wang et al., 29 Mar 2025, Jiang et al., 2024).

1. Definition and Fundamental Principles

The OmniLLM paradigm generalizes the classic LLM architecture to operate over multimodal token sequences. Formally, an OmniLLM extends a pre-trained LLM backbone to accept arbitrary combinations of modalities Xi\mathcal{X}_i, each processed by an encoder fif_i and aligned (via gig_i) into a shared embedding space suitable for a transformer backbone FF (Jiang et al., 2024). This setup accommodates discrete or continuous representations:

M:  {Xi}i=1M{fi}{Fi}{gi}{Ei}FH{πi}{Xi^}\mathcal{M}:\; \{\mathcal{X}_i\}_{i=1}^M \xrightarrow{\{f_i\}} \{\mathbf{F}_i\} \xrightarrow{\{g_i\}} \{\mathbf{E}_i\} \xrightarrow{F} \mathbf{H} \xrightarrow{\{\pi_i\}} \{\widehat{\mathcal{X}_i}\}

Unlike earlier MLLMs designed for narrow cross-modal tasks, OmniLLMs model all modalities as foreign languages, enabling arbitrary interleaving and joint understanding/generation, without privileging any single input/output format (Jiang et al., 2024). The model operates over a unified token vocabulary,

V=VtextVimageVspeech\mathcal{V} = \mathcal{V}_{\mathrm{text}} \cup \mathcal{V}_{\mathrm{image}} \cup \mathcal{V}_{\mathrm{speech}}

with special tokens marking modality boundaries, enabling causal, autoregressive modeling of the joint token stream (Cheng et al., 25 Jan 2026).

2. Model Architecture and Unified Tokenization

2.1 Tokenization and Modality Demarcation

OmniLLMs rely on a joint vocabulary produced by aggregating discrete codes from specialized tokenizers:

  • Text: Typically SentencePiece or BPE units for all languages.
  • Images: Discrete VQ codes (e.g., scene-aware VQ tokenizers) flattened into 1D sequences.
  • Speech/Audio: Single-codebook acoustic tokenizers (e.g., WavTokenizer) optimized for low-latency and minimal token rates.

Within any input/output sequence, modalities are marked by explicit boundary tokens (e.g., <boi>, <eoi> for images; <boa>, <eoa> for audio/speech) to signal modality boundaries and enable fine-grained stream interleaving (Cheng et al., 25 Jan 2026).

2.2 Transformer Backbone and Autoregressive Decoding

Core to recent instantiations (e.g., AR-Omni), a single transformer decoder---typically with 7B+ parameters---is employed, accepting and emitting any tokenized sequence and parameter-sharing across all modalities. Embedding layers and position encodings are shared; normalization strategies such as residual-post-norm ("swin-norm") are often used to promote stability across heterogeneous input streams (Cheng et al., 25 Jan 2026).

Causal modeling enables:

pθ(x)=t=1Tpθ(xtx<t)p_\theta(x) = \prod_{t=1}^T p_\theta(x_t \mid x_{<t})

with interleaved, modality-delineated xtx_t (Cheng et al., 25 Jan 2026).

3. Training Objectives and Optimization

Three principal objectives are typically jointly optimized in state-of-the-art implementations:

  1. Next-token Cross-Entropy: Standard autoregressive loss over the entire output stream,

LCE=t=1Tlogpθ(xtx<t)\mathcal{L}_{\mathrm{CE}} = - \sum_{t=1}^T \log p_\theta(x_t | x_{<t})

  1. Task-aware Loss Reweighting: To address severe imbalances in token counts per modality (e.g., hundreds of audio tokens per utterance vs. fewer than 32 per caption), per-modality weights wm=f(Nm)w_m=f(N_m) are introduced:

Lreweighted=m{text,image,speech}wmLm\mathcal{L}_{\text{reweighted}} = \sum_{m \in \{\text{text,image,speech}\}} w_m \mathcal{L}_m

with wmw_m scaled according to output length to prevent dominance by long-sequence modalities (Cheng et al., 25 Jan 2026).

  1. Auxiliary Perceptual Losses: In image generation, a lightweight perceptual alignment loss is optimized to ensure output tokens are structurally and semantically consistent with ground-truth image codes (token-level alignment in a projected space) (Cheng et al., 25 Jan 2026).
  2. Compound Objective: Total loss is expressed as

L=Lreweighted+λPALPA\mathcal{L} = \mathcal{L}_{\text{reweighted}} + \lambda_{\mathrm{PA}} \mathcal{L}_{\mathrm{PA}}

with λPA\lambda_{\mathrm{PA}} small to regularize rather than dominate (Cheng et al., 25 Jan 2026).

Empirically, these objectives ensure competitive quality and stable convergence across modalities in unified training.

4. Decoding Policy and Modality-Dependent Generation

A notable challenge for OmniLLMs is maintaining suitable output characteristics for both deterministic (e.g., ASR, TTS) and open-ended (e.g., dialogue, generative image) tasks. This is addressed using a finite-state machine decoding framework, where the decoding strategy is switched dynamically based on modality markers:

  • ASR/TTS: Greedy decoding for maximal stability and determinism.
  • Open-Ended (text/image): Sampling-based decoding (top-kk, nucleus) to enable creative variation.

Transitions are triggered by emission of special tokens marking the end of user input, model responses, or media boundaries. This mechanism maintains high fidelity in deterministic tasks and creative diversity in generative ones (Cheng et al., 25 Jan 2026).

5. Empirical Evaluation and Performance Characteristics

Benchmarks

The AR-Omni model achieves strong results across standard multi-modal benchmarks, including:

  • Image Captioning: MS-COCO Karpathy split, CIDEr = 56.53.
  • Text-to-Image Generation: CLIP-score = 0.24 on 30K prompts.
  • Automatic Speech Recognition: LibriSpeech test-clean, WER = 9.4% at 40 tokens/sec.
  • Zero-shot TTS: VCTK, WER (Whisper transcript) = 6.5%, first-token latency (FTL) = 146 ms, real-time factor (RTF) = 0.88 (faster-than-real-time).

By generating speech at 40 tokens/sec and beginning emission as soon as the first token is available, sub-200 ms latency and RTF < 1 are achieved, without additional codebook or expert-modules (Cheng et al., 25 Jan 2026).

Observations and Limitations

  • Unified pipeline advantages: A pure OMNI-LLM with a single vocabulary and decoder eliminates the need for external diffusion decoders or speech synthesizers, simplifying both training and deployment.
  • Image quality gap: Autoregressive approaches for image generation, while unified, still lag behind diffusion-based methods in visual fidelity, especially on complex creative prompts.
  • Future directions: Research directions include integrating hybrid AR-diffusion decoders, generalizing tokenization and alignment to video, 3D, and other modalities, and developing adaptive loss balancing as new modalities are introduced (Cheng et al., 25 Jan 2026).

6. Broader Context, Evaluation Protocols, and Research Directions

Definition in Benchmark Context and Streaming

The OmniMMI benchmark defines an OmniLLM as a model incorporating encoders for video, audio, and speech that processes multi-modal streams incrementally (streaming) and proactively initiates or interrupts responses, maintaining “agentic” first-person perspective and KV-cache for memory, without access to future context (Wang et al., 29 Mar 2025). The ability to perform streaming, proactive reasoning (e.g., real-time event detection, turn-taking, and action planning) is required for compliance with this definition.

Core Challenges and Open Problems

  • Modality imbalance and catastrophic forgetting: Balancing learning across modalities remains nontrivial; naive mixing can degrade uni-modal task performance or cause forgetting during joint training (Jiang et al., 2024).
  • Token efficiency and computational cost: Long multi-modal sequences, especially from audio and video, can saturate the model’s context window and inflate FLOPs, motivating research on parameter-efficient token compression and adaptive balancing (Cheng et al., 25 Jan 2026, Jiang et al., 2024).
  • Temporal and grounding limitations: Pure causal models may struggle with long-range dependencies and grounding across multi-modal, temporally extended contexts (Jiang et al., 2024).
  • Quality gap to modality-specific experts: There remains a gap compared to highly-optimized, modality-specific models on certain pure uni-modal or high-resolution tasks (Jiang et al., 2024).

Canonical Research Programs

OmniLLMs, typified by the AR-Omni model, are foundational for ongoing investigations into:

  • Agentic multi-modal systems: Streaming, goal-driven perception and action.
  • Scalable instruction tuning: Cross-modal instruction-following, leveraging synthetic and human-annotated datasets.
  • Unified, real-time assistants: End-to-end, zero-latency applications in real-time speech, vision, and action.

The AR-Omni model demonstrates that an autoregressive transformer, with a unified token vocabulary and carefully designed optimization, can serve as a scalable backbone for such omni-modal agents, though further hybridization and data-driven advances are required for full parity with expert pipelines (Cheng et al., 25 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniLLM.