Papers
Topics
Authors
Recent
Search
2000 character limit reached

Omni-modal Large Language Models

Updated 6 February 2026
  • Omni-LLMs are unified transformer models that ingest text, images, audio, and video with modality-specific token tagging to ensure strong individual and cross-modal reasoning.
  • They integrate diverse data inputs through joint sequence modeling, enabling seamless performance on both standalone perception and complex multi-modal tasks.
  • Benchmarking with MMAO-Bench and the derived compositional law highlights the importance of balanced modality capabilities and deep fusion to overcome bottlenecks.

An omni-modal LLM (Omni-LLM) is a transformer-based model that ingests and proximately reasons over four core modalities—text, static images, audio, and video—by jointly embedding, integrating, and decoding modality-tagged token sequences within a single, unified architecture. Omni-LLMs aim to preserve strong uni-modal competencies (e.g., reading, visual recognition, speech perception) while enabling complex cross-modal reasoning (e.g., synchronously answering “What is the person saying in the video and which object are they referring to?”), establishing them as the next milestone in general-purpose AI intelligence beyond uni-modal or bi-modal designs.

1. Definition and Modalities Unification

Omni-LLMs, as formalized in recent research, extend traditional LLMs and vision-LLMs by ingesting sequences from the following modalities: text (T), images (I), video (V), and audio (A). Each input is mapped by a modality-specific encoder to sequences of tokens tagged by modality (e.g., ⟨IMG⟩, ⟨VID⟩, ⟨AUD⟩) and concatenated to standard text tokens as input to a single transformer backbone. This unification mechanism enables the model to perform both standalone (uni-modal) and cross-modal reasoning within any context window, using the same attention and positional encoding scheme for all tokens (Chen et al., 21 Oct 2025).

The core requirements for such models are:

  • Preservation of strong uni-modal capacities across all input types.
  • Deep cross-modal integration at the level of token co-attention, enabling arbitrarily complex inference spanning any subset of modalities.

2. Benchmarking: MMAO-Bench and Evaluation Protocols

The MultiModal All-in-One Benchmark (MMAO-Bench) provides a comprehensive suite for evaluating omni-LLMs across both uni-modal and cross-modal tasks (Chen et al., 21 Oct 2025). MMAO-Bench consists of 1,880 human-curated QA pairs, spanning 44 task types and organized along two axes:

  • Perception: including object detection, attribute recognition, and alignment.
  • Reasoning: covering spatial, temporal, general STEM, and multi-step complex reasoning.

A distinctive feature is the inclusion of Multi-Step Open-Ended (MO) questions, in which a complex cross-modal task is decomposed into 2–4 interdependent sub-questions. Models must produce free-form text answers for each, scored across steps (up to 10 points per MO item). This design probes genuine cross-modal composition, and 98% of MMAO-Bench queries require true cross-modal information.

Primary metrics:

  • Uni-modal: accuracy on text-only, image-only, audio-only subsets.
  • Omni-modal:
    • Omni-MC: accuracy on cross-modal multiple-choice items.
    • Omni-MO: point average on multi-step open-ended items.

Scores are reported per model, per task cluster (perception vs reasoning), and overall.

3. Compositional Law Between Uni-modal and Omni-modal Performance

A key empirical advance derived from MMAO-Bench is the compositional law relating omni-modal (cross-modal) performance (PoP_o) to the product of uni-modal visual (PvP_v) and audio (PaP_a) accuracies. Evaluated on eight diverse models, the law is:

Po=C(Pv×Pa)γP_o = C \cdot (P_v \times P_a)^\gamma

where C1.0C \approx 1.0 and γ1\gamma \approx 1 for mid-to-high-performance models, but γ<1\gamma < 1 for lower-performing models (R20.95R^2 \approx 0.95 on log-log regression) (Chen et al., 21 Oct 2025). This relationship implies that cross-modal performance is tightly predicted by the geometric mean of visual and audio skills, with the fitted C,γC,\,\gamma serving as indicators of model integration capabilities.

  • Bottleneck Effect ("Short-Board" Law): If either PvP_v or PaP_a is low (notably <30%<30\%), PoP_o collapses to the weaker value, indicating no cross-modal synergy.
  • Synergy Effect: For Pv,Pa70%P_v,\,P_a \gtrsim 70\%, PoP_o exceeds the product, indicating superadditive cross-modal capacity.

Example quantitative data (from (Chen et al., 21 Oct 2025)):

Model PvP_v PaP_a Pv×PaP_v \times P_a PoP_o (Omni-MC) Δo\Delta_o
MiniCPM-o-2.6 41.9 63.1 26.4 26.8 +0.4
Qwen-2.5-omni-7B 52.8 68.1 36.0 33.2 -2.8
Gemini-2.5-Pro 80.1 78.7 63.0 76.6 +13.6

The residual Δo=Po(Pv×Pa)\Delta_o = P_o - (P_v \times P_a) quantifies degree of synergy or bottlenecking.

4. Architectural and Training Paradigms

Omni-LLMs rely on architectural designs that ensure both strong modality-specific encoders and deep late fusion for joint multimodal representation, typically via transformers with L12L \geq 12 fusion layers (Chen et al., 21 Oct 2025). Common features in recent models include:

  • Unified token stream: simultaneous input of all modality tokens into the transformer, processed identically during attention and sequence modeling.
  • Late fusion and self-supervised cross-modal objectives: masked audio-visual modeling is used for pretraining fusion layers.
  • Intermediate ASR/caption heads: audio reasoning is enhanced by internal generation of speech-to-text alignments (ablation studies demonstrate the value of such mechanisms).

A critical architectural insight is the necessity for balanced uni-modal mastery: to enable synergy and avoid short-board effects, encoder performance for each modality must reach at least 60–70% (Chen et al., 21 Oct 2025).

Recommendations for training and system design:

  • Adaptive Modality Dropout: train models to degrade gracefully if one modality is missing or corrupted.
  • Extend to further modalities: e.g., speech-to-gesture, 3D scenes.
  • Data-efficient sampling: employ clustering-guided compression to scale evaluation without excessive annotation cost.

5. Implications for Model Development and Field Evolution

The identification of the compositional law and bottleneck/synergy effects in omni-modal intelligence provides clear goals for model design:

  • Optimization of the weakest modality is essential to elevate overall cross-modal performance and unlock potential for emergent omni-modal reasoning.
  • Fostering synergy: Sufficient fusion depth and joint representation learning can move models from the bottleneck regime into superadditive performance.
  • Unified evaluation suites: Benchmarks like MMAO-Bench accelerate research by enabling direct, fine-grained comparison of uni-modal, cross-modal, and compositional abilities.

MMAO-Bench serves as both a one-stop evaluation suite and a source for empirical laws of cross-modal intelligence, guiding the progression from shallow integration to deeply synergistic, balanced omni-modal LLMs. The compositional law and associated formalism provide researchers with predictive tools for diagnosing and improving multimodal models.

6. Future Challenges and Research Directions

Key open directions include:

  • Scaling to new modalities: integrating additional sensory streams (3D, haptics, gesture) using the established token-tagging and projection pipeline.
  • Robust reasoning: extending multi-step open-ended (MO) evaluation to STEM, programming, and physical reasoning queries tied to multimodal input.
  • Adaptive context length: maintaining performance across variable context window sizes and streaming settings.
  • Training regimes: exploring joint versus curriculum-based fusion and alignment, as well as reinforcement learning for long-horizon cross-modal tasks.

Extending the compositional law to new modalities and high-level reasoning can inform scaling laws for compositional AGI systems. Moreover, research in adaptive fusion and modality dropout will be necessary to ensure resilient intelligence under partial, noisy, or adversarial input conditions (Chen et al., 21 Oct 2025).


References:

  • "MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Uni-modal and Omni-modal in OmniModels" (Chen et al., 21 Oct 2025)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Omni-modal Large Language Models (Omni-LLMs).