Multimodal Large Models (MLMs)

Updated 24 November 2025

Multimodal Large Models are unified neural architectures that jointly process heterogeneous data streams such as text, images, audio, and video using modality-specific encoders and fusion mechanisms.
They employ advanced fusion strategies—early, intermediate, late, or hybrid—along with training paradigms like self-supervised learning, Mixture-of-Experts, and parameter-efficient fine-tuning to boost performance.
Practical applications of MLMs span text-to-image generation, visual question answering, audio processing, and recommendation tasks while addressing challenges like memory efficiency and semantic alignment.

Multimodal Large Models (MLMs) are advanced neural architectures designed for joint processing, alignment, and reasoning across heterogeneous data modalities. These models extend traditional LLMs by integrating visual, audio, video, or other sensory streams into a unified computational framework. The field has recently seen rapid progress—both due to advances in scalable training on diverse multimodal corpora and through innovations in model architectures, efficient adaptation, and evaluation methodologies. Contemporary MLM research spans architectural design, optimization for bandwidth and hardware constraints, model composition, task-specific adaptation, and probing the conceptual limitations of cross-modal understanding.

1. Core Definitions and Architectural Principles

MLMs, sometimes termed Multimodal LLMs (MLLMs), are typically defined as large transformer-based models that jointly process input streams from two or more heterogeneous modalities, most commonly text and vision, but increasingly with support for audio, video, and other structured sensory data (Wang et al., 2024, Liu et al., 12 Jun 2025). The canonical architecture consists of:

Modality encoders: Vision transformers (e.g., CLIP-ViT, ViT), speech encoders (e.g., Whisper, HuBERT), and text LLMs (BERT, LLaMA, GPT variants) encode raw data into modality-aligned embeddings.
Feature projection: Each modality is projected into a shared latent space via learned linear or MLP adapters.
Fusion mechanism: Feature vectors are fused via concatenation, cross-modal attention, or mixture-of-experts (MoE) modules, producing a joint sequence or pooled representation (Caffagni et al., 2024, Luo et al., 16 Jul 2025).
Output decoders: Task-specific heads (language decoders for textual outputs, diffusion or GAN models for generative tasks) produce output in the target modality or modalities.

Fusion strategies are categorized as early, intermediate, late, or hybrid, demarcating whether and where in the pipeline modality interaction takes place (Wang et al., 2024, Caffagni et al., 2024). Enhanced capacity for long-context and multi-modal reasoning is achieved through mechanisms such as hybrid transformer-state space models (e.g., MAMBA) (Zhou et al., 2024, Huang et al., 2024, Qiao et al., 2024), monolithic MoE layers (Luo et al., 16 Jul 2025), or explicit cross-modal alignment heads.

2. Training Paradigms, Objectives, and Model Specialization

MLMs are pre-trained and then specialized through a combination of objectives (Han et al., 29 May 2025, Caffagni et al., 2024):

Self-Supervised Learning (SSL): Masked prediction or contrastive alignment between modalities is foundational. For instance, CLIP-style InfoNCE contrastive loss aligns vision and text in a joint embedding space (Luo et al., 16 Jul 2025, Caffagni et al., 2024):

$L_{\text{InfoNCE}} = -\mathbb{E}_{(v, t)} \left[ \log \frac{\exp(\text{sim}(f_v(v), f_t(t))/\tau)}{\sum_{t'} \exp(\text{sim}(f_v(v), f_t(t'))/\tau)} \right]$

Supervised Fine-Tuning: Downstream tasks are optimized using cross-entropy or contrastive objectives, with adapters (LoRA, MWA, etc.) or entire modal backbones fine-tuned (Zhang et al., 6 May 2025, Long et al., 2023).
Mixture-of-Experts (MoE): MoE architectures increase specialization and computational efficiency. Each expert processes tokens from a specific modality or subtask, optionally with learned dynamic gating (Luo et al., 16 Jul 2025, Li et al., 2024).
Parameter-Efficient Fine-Tuning (PEFT): Low-rank updates (LoRA), adapters, and prompt tuning allow rapid adaptation to new modalities or tasks without full-model retraining (Long et al., 2023, Chen et al., 2024).

Specialized data regimes (e.g., inclusion of multi-grained concept annotations (Xu et al., 2024), multi-image programmatically generated instructions (Zhang et al., 2024), or modality-specific token compression (Huang et al., 19 Oct 2025)) are increasingly employed to strengthen both breadth and depth of multi-modal concept learning.

3. Efficiency, Scalability, and Communication Constraints

MLMs often face severe efficiency bottlenecks due to high computational cost and bandwidth demands:

Token Communication and Compression: In resource-constrained deployments (e.g., edge networks), MLMs must minimize communication overhead. Token communication-driven frameworks propose splitting model inference/fine-tuning between devices and central infrastructure, transmitting only compressed, semantically aligned tokens over the channel. Techniques such as sliding-window token selection and contrastive split fine-tuning are shown to achieve 13.7 percentage point improvements in task accuracy under wireless constraints (Zhang et al., 6 May 2025).
Linear Complexity Backbones: Large-scale vision-language modeling with traditional transformers scales quadratically in sequence length. Substituting state-space models (e.g., Mamba-2) enables linear complexity in both vision and text, greatly accelerating inference and reducing memory footprint for long sequences or high-resolution images (Huang et al., 2024, Qiao et al., 2024, Zhou et al., 2024).
Attention/Adapter Pruning and Token Compression: Pruning redundant attention layers and compressing visual tokens adaptively (e.g., MVTC module) substantially reduces inference cost and memory while maintaining accuracy in knowledge graph tasks (Huang et al., 19 Oct 2025).

4. Functional Capabilities and Task Domains

MLMs now span a wide diversity of output modalities and task domains (Han et al., 29 May 2025, Wang et al., 2024):

Text-to-Text, Text-to-Image, and Text-to-Video: Language generation, image captioning, visual question answering (VQA), text-conditioned image/video generation, spatial/temporal concept localization.
Audio and Music Generation: Speech-to-text, text-to-speech, text-to-music, and more recently text-conditioned human motion generation.
Graph and Recommendation Tasks: Integration with multimodal knowledge graphs (Huang et al., 19 Oct 2025, Liu et al., 12 Jun 2025) and sequential recommendation (Ye et al., 2024) leverages both visual/textual entity attributes and temporal dynamics.
Instruction Following and Reasoning: Unified-MLLMs (Li et al., 2024) and models like Mono-InternVL-1.5 (Luo et al., 16 Jul 2025) can serve as task routers, emitting special tokens that direct external expert modules for segmentation, editing, or generative subtasks.

Notably, hybrid approaches allow the LLM output stream to include both natural-language and structured control/grounding tokens, facilitating seamless expert orchestration for compositional, multi-step, and modular tasks.

5. Empirical Evaluation, Performance, and Limitations

Comprehensive benchmarks reveal several key performance trends and limitations:

Benchmarking Large vs. Small MLMs: Small MLMs (e.g., LLaVA-series, Phi-3-Vision) achieve parity with large MLMs in simple recognition tasks and domain-specific applications but lag on deep reasoning, fine-grained localization, and long-context tasks (e.g., color recognition: GPT-4o 63.6% vs. LLaVA-NeXT 28.6%; temporal ordering: GPT-4o 66.7%, LLaVA-NeXT <30%) (Feng et al., 4 Jan 2025).
Graph Learning: Joint text–vision fusion outperforms unimodal models by up to +26 percentage points accuracy in node classification; fine-tuned standalone MLMs attain highest absolute accuracy, with only marginal (<2 pp) gains from explicit graph structure except in very dense graphs (Liu et al., 12 Jun 2025).
Token Compression and Pruning: Efficient MLMs using techniques such as MVTC and attention pruning achieve up to 4.5× speedup in wall-clock inference for knowledge graph completion with negligible accuracy tradeoff (Huang et al., 19 Oct 2025).
Task Adaptation: Deep adapters (MWA) can nearly match full fine-tuning (≤1% retrieval drop), with up to 57% reduction in training time and 20% less memory; ablations confirm that deep alignment enhancements and parameter decoupling are critical (Long et al., 2023, Chen et al., 2024).
Conceptual Limits: Although SoTA MLMs excel at surface-level visual perception (>90% accuracy), a persistent perception–comprehension gap (e.g., ~16% error in sarcasm inference) remains due to weaknesses in context integration and pragmatic reasoning (Zhang et al., 29 May 2025).
Instruction Data: Programmatically generated, graph-structured instruction datasets (e.g., ProVision-10M) yield up to 8% performance gain in vision-centric tasks, outperforming purely generations from large LLMs by guaranteeing symbolic grounding and eliminating hallucination (Zhang et al., 2024).

6. Model Composition, Modularity, and Integration

Model composition represents a paradigm for constructing versatile MLMs by merging pretrained modality-specific encoders and LLMs, facilitating rapid multi-modal extension without expensive retraining (Chen et al., 2024). Approaches such as NaiveMC and DAMC partition the LLM parameters, align them via averaging or decoupled weighted fusion, and empirically outperform ad-hoc modular adapters, especially in multi-domain tasks (e.g., MCUB-4: DAMC 60.08%, NaiveMC 54.03%).

Modular frameworks such as UnifiedMLLM extend this by allowing MLMs to emit task and grounding tokens during decoding, with a light-weight routing mechanism delegating input to appropriate external expert modules. This design provides strong scalability, as the addition of new tasks or domains only requires defining and registering new token pairs, removing the need for full model retraining (Li et al., 2024).

Notably, dynamic hybridization of transformer blocks, state-space modules, and mixture-of-experts adds further modularity, supporting efficient scaling to very long contexts and diverse modality mixtures (Zhou et al., 2024, Luo et al., 16 Jul 2025).

7. Open Challenges and Future Directions

Despite rapid progress, several key challenges persist (Zhang et al., 29 May 2025, Han et al., 29 May 2025, Wang et al., 2024):

Explanability and Interpretability: Opaque fusion mechanisms limit tracing input contributions to output, inhibiting trust for safety- and mission-critical deployments.
Data Efficiency and Coverage: Under-resourced modalities, noisy pairings, and limited scale in available multi-modal instructions slow generalization and multimodal transfer.
Computational/Memory Demands: Quadratic scaling in transformer architectures strains hardware resources; state-space and MoE models partially address this but require further optimization for very high resolution, video, and temporal tasks.
Semantic Alignment and Reasoning: MLMs master surface alignment but exhibit persistent deficits in pragmatic inference, context integration, and cross-modal structured reasoning (e.g., understanding sarcasm, abstract scene dynamics).
Unification and Extensibility: Moving towards seamless any-to-any multimodal architectures and enabling task-agnostic representation without performance regressions remains an open research direction.

Proposed solutions include explicit reasoning and affective modules, dynamic expert specialization and routing, programmatic instruction generation, efficient hybrid architectures (SSM + transformer + MoE), and continual multi-modal curriculum learning. Evaluation protocols are moving towards multi-grained concept and programmatically generated QA datasets to better probe reasoning depth and robustness.

References: