Omni-LLMs: Unified Multimodal Transformers
- Omni-LLMs are unified transformer-based architectures that process and fuse diverse explicit and implicit modalities—text, images, audio, video, and conceptual entities—for comprehensive reasoning.
- They employ modality-specific encoders and joint latent spaces to integrate structured and unstructured data, facilitating seamless cross-modal attention and compositional processing.
- They feature progressive multi-stage training paradigms, modular fusion mechanisms, and scalable system designs that drive advances in multimodal machine intelligence and real-time inference.
Omni-LLMs (Omni-LLMs) are unified, transformer-based architectures that extend the boundaries of multimodal language modeling to jointly process, understand, and generate across all explicit modalities—text, image, audio, video—and, in advanced formulations, arbitrary conceptual entities. These systems are characterized by their modality-agnostic tokenization pipelines and joint latent spaces, enabling seamless fusion and reasoning over heterogeneous structured and unstructured data. The development of Omni-LLMs marks a paradigm shift from conventional multimodal LLMs, positioning them as foundational models for multi-sensory machine intelligence, interactive agents, and cross-modal cognitive tasks.
1. Formal Definition and Distinctive Capabilities
An Omni-LLM is a transformer-based model whose input space encompasses both explicit modalities (e.g., text, image, audio, video) and implicit, conceptual modalities—such as geospatial coordinates, temporal intervals, numerical entities, and organizational or financial entities—each mapped to a finite sequence of unified embedding tokens via modality-specific encoders (Unlu et al., 2023). This generalizes beyond traditional “modality token” approaches, treating every information source or extractible entity as a first-class modality in the model’s computational pipeline.
Salient capabilities distinguishing Omni-LLMs include:
- Tokenization of any entity type, explicit or implicit, into a joint embedding space.
- Unified attention and fusion across modalities, including complex interleaving (text interleaved with images, speech, entities, etc.).
- Structured, stepwise compositional reasoning that integrates global context over all modalities, with explicit mechanisms to prevent “shortcut” behavior—answering solely from textual priors (Yang et al., 26 Jun 2025).
- Native support for bidirectional input (understanding any combination of inputs) and output (text, audio, image) streams (Guo et al., 26 Feb 2025, Li et al., 16 Nov 2025).
2. Architectural Principles and Modal Fusion
Omni-LLMs utilize modular modality-specific encoders for each modality in the set of modalities. These encoders map modality-specific data (e.g., pixels, waveforms, structured attributes) into embedding sequences .
The canonical fusion mechanism is concatenation or prepending of all modality tokens to the main language sequence, with the joint context processed by a shared transformer backbone. Internally, variants include:
- Explicit fusion heads for pooled modality embeddings, using weighted sum, concatenation plus linear projection, or cross-modal attention (Unlu et al., 2023).
- Interleaved positionally-aware tokens to preserve temporal and spatial structure (e.g., 3D RoPE for alignment in Uni-MoE-2.0-Omni) (Li et al., 16 Nov 2025).
- Dual-track “brain-mouth” designs to decouple reasoning and generation streams (MGM-Omni), enabling low-latency, real-time output while maintaining global context (Wang et al., 29 Sep 2025).
- Mixture-of-experts (MoE) with dynamic capacity allocation, null experts for skipping computation, and shared experts for modality-agnostic backbone support (Li et al., 16 Nov 2025).
Notable models applying these principles include OpenOmni (Luo et al., 8 Jan 2025), Capybara-OMNI (Ji et al., 10 Apr 2025), River-Omni (Li et al., 11 Oct 2024), MGM-Omni (Wang et al., 29 Sep 2025), Uni-MoE-2.0-Omni (Li et al., 16 Nov 2025), and InteractiveOmni (Tong et al., 15 Oct 2025). Each adopts modular vision and audio frontends (e.g., CLIP-ViT, InternViT, QFormer + Whisper), linear projectors to the shared latent space, and per-modality adapters for efficient fusion.
3. Training Paradigms and Optimization Objectives
Omni-LLMs are trained via multi-stage, progressive alignment:
- Stage 1: Separate bi-modal alignments (text–image, text–audio) with LLM frozen and adapters trained to match text representations, often using standard cross-entropy. This leverages abundant bi-modal corpora and pivots all modalities through text (Luo et al., 8 Jan 2025).
- Stage 2: Joint multimodal fine-tuning, unfreezing the LLM and applying cross-entropy losses to arbitrary modality combinations; step- or dynamically-adaptive balancing handles imbalanced data sizes and convergence rates (Guo et al., 26 Feb 2025).
- Stage 3: Instruction tuning and reinforcement learning, with specialized objectives for reasoning. For instance, HumanOmniV2 applies a reward decomposition over format, context summary, logical integration, and answer accuracy, using LLM-judged rewards and GRPO/DPO variants for behavioral alignment (Yang et al., 26 Jun 2025, Li et al., 16 Nov 2025).
Typical objectives include:
- Masked Language Modeling (MLM) for anchoring text comprehension within a multimodal context (Unlu et al., 2023).
- Cross-modal contrastive alignment for paired modality representations.
- Entity reconstruction losses for implicit modalities, enforcing algebraic/topological preservation.
- Direct Preference Optimization (DPO) or Group Sequence Policy Optimization (GSPO) for preference-based reinforcement learning over complex output spaces.
- CTC/CTC-style losses for efficient non-autoregressive speech generation (Luo et al., 8 Jan 2025, Wang et al., 29 Sep 2025).
“Modality-robust” preservation—reserving a fraction of batches for pure-text updates—prevents catastrophic forgetting of language skills when the backbone is fully unfrozen (Guo et al., 26 Feb 2025).
4. Modalities, Extension to Implicit Entities, and Representation Strategies
The set of admissible modalities in Omni-LLMs is open-ended, covering:
- Explicit: text, image (patches or grids), video (sampled frames), speech/audio (raw waveform, Mel-spectrogram, audio tokens).
- Implicit/conceptual: geospatial objects (polygons, graphs), temporal intervals (dates, periods), numbers (scalars, distributions), structured organizations (tabular/graphical metadata), and, recursively, any semantic entity detected and encoded via entity-specific adapters (Unlu et al., 2023).
Integration of conceptual entities proceeds through:
- NER-style detection and routing to dedicated encoders (, , etc.).
- Recursive tokenization, where complex entities (dates inside geospatial addresses) trigger cascaded passes.
- Embedding design ensures that geometric/temporal/numeric relationships are algebraically preserved, using graph neural nets (for OSM/transport), domain-specific features, and auxiliary reconstruction losses (Unlu et al., 2023).
This “omni-modality” approach subsumes prior multimodal and knowledge-grounded systems, permitting fluid interchange between data-driven and symbolic reasoning.
5. Benchmark Results and Comparative Performance
Recent models demonstrate competitive or state-of-the-art performance on a wide range of benchmarks:
- M2-omni (72B) achieves 75.1% on OpenCompass vision-text, 69.6 on MVBench (video), 2.07–5.29% WER on LibriSpeech/Aishell, and 49.2 CIDER on AudioCaps (Guo et al., 26 Feb 2025).
- River-Omni (7B) approaches leading results on MMLU, CMMLU, AGIEval, SEED-IMG, MMMU, RealWorldQA, and VQA; surpasses others on Fleurs/WenetSpeech ASR and multi-modal AIR-Bench (Li et al., 11 Oct 2024).
- Capybara-OMNI (7B) attains 68.2 average across image/multimodal QA, 65.7–70.0 (video), 2.1–5.2 CER/WER (audio), matching or surpassing Qwen2-VL and VITA (Ji et al., 10 Apr 2025).
- OpenOmni surpasses VITA on OmniBench by 3.95 absolute points with ∼1/5 the data (Luo et al., 8 Jan 2025).
- HumanOmniV2 sets new standards on context-rich multimodal reasoning and intention/emotion QA, with explicit ablations showing each reward component’s incremental value (Yang et al., 26 Jun 2025).
- InteractiveOmni (4B, 8B) matches/exceeds 7B+ open baselines on OmniBench, multi-turn memory (MMMB: up to 58.2 avg) and TTS (test-zh WER 1.37) (Tong et al., 15 Oct 2025).
- Uni-MoE-2.0-Omni establishes SOTA or near-SOTA across 85 benchmarks, exceeding Qwen2.5-Omni by +7pp in omni-modal and video understanding, and reducing ASR WER by 47% on long-form input (Li et al., 16 Nov 2025).
For emotion recognition and cognitive state classification, zero-shot Omni-LLMs rival or surpass fine-tuned audio models, especially when explicit chain-of-thought or acoustic prompting is used—audio-only weighted-F1 on IEMOCAP: 51.8% for GPT-4o-Audio, outperforming Gemini and Phi-4-Multimodal (Murzaku et al., 27 Mar 2025).
6. Systems, Scalability, and Engineering Innovations
Scaling and efficiently training Omni-LLMs requires sophisticated system designs to manage architectural heterogeneity and communication overhead:
- Model-centric distributed recipes (VeOmni) decouple component logic from parallelism, allowing dynamic data/sequence/expert (3D) parallelism across modalities. Near-linear scaling up to 128 GPUs and context lengths of 160,000 tokens with >80% efficiency has been demonstrated (Ma et al., 4 Aug 2025).
- Chunk-based parallel decoding (MGM-Omni) addresses token-rate mismatches for real-time, long-form speech generation without sacrificing prosodic stability (Wang et al., 29 Sep 2025).
- Progressive, curriculum-based addition of new modalities, continuous curriculum learning, and modular adapter/block APIs enable extensible architectures without rewriting distributed code.
Best practices emerging from large-scale experiments emphasize:
- Isolating distributed training strategy from architectural implementation.
- Overlapping communication and computation to maintain hardware efficiency as context and parameter count grow.
- Maintaining pure-text fluency and preventing modality collapse through data and loss balancing.
7. Open Challenges and Future Directions
Despite rapid progress, several open problems persist:
- Entity detection and routing: robust, automatic tagging of arbitrary conceptual entities in open-domain streams remains unsolved. Lightweight router heads and recursive encoding strategies are promising but require further study (Unlu et al., 2023).
- Data scarcity: modalities such as environmental sound, music, haptics, or high-level graph entities lack large, high-quality paired datasets; targeted data creation is necessary.
- Stability of multi-modal RL: reinforcement learning in multimodal and high-heterogeneity settings remains unstable without careful staged curricula and hybrid objectives (e.g., GSPO-DPO (Li et al., 16 Nov 2025), multi-reward decompositions (Yang et al., 26 Jun 2025)).
- Real-time and resource efficiency: inference for large (10B–70B+) parameter Omni-LLMs at sub-second latency, especially with on-device deployment, is an ongoing engineering challenge (Tong et al., 15 Oct 2025).
- Meta-data and external knowledge: dynamic updating of representation for evolving real-world entities (e.g., corporations, public infrastructure) calls for API/database integration.
- Extensibility: generalizing the “mouth”/decoder paradigm to new output spaces (code, music, video, olfactory) demands standardized, modular interfaces and universal token spaces.
A plausible implication is that advances in modular routing, scalable system recipes, and unified embedding strategies will eventually enable truly open-ended, self-updating Omni-LLMs capable of fluid integration of any structured or unstructured modality, setting the foundation for future machine cognition and human–machine interfaces (Unlu et al., 2023, Luo et al., 8 Jan 2025, Guo et al., 26 Feb 2025, Li et al., 16 Nov 2025).
References
- "Entity Embeddings : Perspectives Towards an Omni-Modality Era for LLMs" (Unlu et al., 2023)
- "OpenOmni: Advancing Open-Source Omnimodal LLMs with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis" (Luo et al., 8 Jan 2025)
- "MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech" (Wang et al., 29 Sep 2025)
- "VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo" (Ma et al., 4 Aug 2025)
- "HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context" (Yang et al., 26 Jun 2025)
- "Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal LLMs" (Ji et al., 10 Apr 2025)
- "M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance" (Guo et al., 26 Feb 2025)
- "Baichuan-Omni Technical Report" (Li et al., 11 Oct 2024)
- "OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs" (Murzaku et al., 27 Mar 2025)
- "Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data" (Li et al., 16 Nov 2025)
- "InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue" (Tong et al., 15 Oct 2025)