Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multimodal Large Language Models (mLLMs)

Updated 18 December 2025
  • Multimodal Large Language Models (mLLMs) are foundation models that fuse language models with modality-specific encoders to process text, images, audio, and more.
  • They leverage cross-modal representation learning through early, intermediate, and late fusion mechanisms, aligning diverse signals into a shared embedding space.
  • Applications range from visual dialogue and captioning to embodied reasoning, while challenges include robustness, interpretability, and efficient multimodal training.

Multimodal LLMs (mLLMs) are foundation models designed to process, align, and generate signals across multiple data modalities, including text, images, audio, video, and 3D structure. By extending the successes of text-only LLMs such as GPT, LLaMA, and Vicuna, mLLMs leverage cross-modal representation learning, reasoning, and large-scale pretraining. This unified approach enables the development of general-purpose, instruction-following agents with capabilities previously unattainable by single-modality systems, supporting a spectrum of applications from visual dialogue and captioning to cross-modal generation and embodied reasoning (Han et al., 29 May 2025, Carolan et al., 2024, Song et al., 2023).

1. Formal Definition, Scope, and Motivations

A Multimodal LLM (mLLM) is a neural architecture that fuses a pretrained LLM with modality-specific encoders (e.g., vision, audio) and corresponding fusion modules, training all or parts of the resulting model such that it can perform cross-modal understanding and generation. The canonical mLLM defines a conditional joint distribution

p(yxtext,xvis,xaud,)p(y\,|\,x_{\text{text}},\,x_{\text{vis}},\,x_{\text{aud}},\,\ldots)

where xx_{*} are inputs from various modalities, typically aligned into a shared latent or embedding space, and yy may correspond to a response in any target modality (Yin et al., 2023, Zhang et al., 2024, Wang et al., 2024, Han et al., 29 May 2025, Liang et al., 2024).

The motivation for mLLMs is grounded in the ubiquity of multimodal signals in human cognition and real-world data. mLLMs offer the following advantages:

  • Semantic grounding: Visual and acoustic cues provide a basis for language grounding, reducing the risk of unanchored generation (Wang et al., 2024).
  • Contextual enrichment: Modality fusion, e.g., adding video or audio, enhances contextual understanding and robustness (Carolan et al., 2024).
  • Cross-modal transfer: Shared embeddings support few-shot and zero-shot generalization to new tasks and domains (Han et al., 29 May 2025, Yin et al., 2023).
  • End-to-end integration: mLLMs consolidate pipelines previously requiring several task-specific models (e.g., ASR + image captioner + LLM) (Wang et al., 2024).

2. Architectural Paradigms and Core Fusion Mechanisms

The prevailing architecture for mLLMs follows an encoder–fusion–decoder pattern:

The cross-modal attention operation is formally defined as:

Attn(Q,K,V)=softmax(QKdk)V\mathrm{Attn}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

where K,VK,V may originate from image, audio, text, or fused modality embeddings (Liang et al., 2024, Fan et al., 3 Jun 2025).

3. Training Paradigms, Objectives, and Optimization

The dominant training pipelines for mLLMs utilize multistage recipes:

  1. Self-supervised pretraining:
    • Contrastive alignment: Aligns paired image–text (or other modalities) in a joint embedding space using CLIP-style losses

    LCLIP=E(xt,xi)[logexp(sim(ft(xt),fi(xi))/τ)xiexp(sim(ft(xt),fi(xi))/τ)]\mathcal{L}_{\mathrm{CLIP}} = -\mathbb{E}_{(x_t,x_i)}\left[\log\frac{\exp(\mathrm{sim}(f_t(x_t), f_i(x_i))/\tau)}{\sum_{x'_i}\exp(\mathrm{sim}(f_t(x_t), f_i(x'_i))/\tau)}\right]

    (Carolan et al., 2024, Han et al., 29 May 2025, Song et al., 2023, Zhang et al., 2024). - Generative alignment: Autoregressive loss on the next-token or target output, optionally conditioned on multimodal context

    LLM=tlogp(yty<t,Emodality,Etext)\mathcal{L}_{\mathrm{LM}} = -\sum_t \log p(y_t | y_{<t}, E_{\mathrm{modality}}, E_{\mathrm{text}})

    (Han et al., 29 May 2025, Carolan et al., 2024).

  2. Supervised and instruction tuning:

  3. Reinforcement learning from human feedback (RLHF):

Parameter-efficient fine-tuning (PEFT) variants (e.g., LoRA, adapters, prefix-tuning) and retrieval-augmented generation (RAG) are frequently used to enable efficient adaptation or domain transfer (Carolan et al., 2024, Zhang et al., 2024, Zhang et al., 2024).

4. Functional Taxonomy and Coverage of Modalities

mLLMs are classified both by the supported modalities and by the direction of cross-modal mapping. A representative taxonomy includes:

Input Output Examples Challenges
Text Text GPT-4, LLaMA-2, InstructGPT Context window constraints, hallucination
Text Image DALL·E 2/3, Stable Diffusion Prompt compositionality, bias
Text Audio MusicLM, AudioLDM Long-range structure, subjective quality
Text Video Lumiere, Sora, Make-A-Video Temporal consistency, data scarcity
Text 3D DreamFusion, Shap-E, Magic3D Multi-view geometry, rendering realism
Multi-input Multi-out NExT-GPT, ModaVerse Generalization, modality interaction

(Han et al., 29 May 2025, Zhang et al., 2024, Liang et al., 2024, Caffagni et al., 2024)

Recent models handle not just static image+text, but video, audio, human motion, point clouds, multimodal graphs, and complex document layouts (Wang et al., 2024, Fan et al., 3 Jun 2025, Zhang et al., 2024, Chen et al., 2024).

Hybrid methods leverage model composition, enabling the synthesis of new mLLMs from expert submodels without full retraining, e.g., via parameter decoupling/merging (DAMC) for arbitrary modality sets (Chen et al., 2024). Mixture-of-Experts (MoE), MoE diffusion, and modular adaptation further extend capacity and specialization (Han et al., 29 May 2025, Liang et al., 2024).

5. Empirical Benchmarks, Reasoning Capabilities, and Limitations

mLLMs are evaluated on a range of benchmarks:

  • Image Captioning: COCO Captions (CIDEr, BLEU), NoCaps, Flickr30k.
  • Visual Question Answering: VQA v2, OK-VQA, ScienceQA, GQA, MathVista.
  • Retrieval & Grounding: Recall@k, region mAP on RefCOCO, TextVQA, TextCaps, DocVQA.
  • Audio & Video: AudioCaps (WER, CIDEr), MSRVTT-QA (Acc), LibriSpeech (ASR), MELD/MOSEI (F1).
  • Reasoning: MMMU, MMBench, MM-VET, InfiMM/EmbodiedEval for rich multimodal reasoning (Wang et al., 2024, Wang et al., 2024, Fan et al., 3 Jun 2025).

Key findings:

  • State-of-the-art generic mLLMs (e.g., BLIP-2, LLaVA, MiniGPT-4) reach 75–85% on VQAv2 and COCO CIDEr ∼130–148, but complex cognitive tasks—reasoning about intent, emotion, or high-level communication—remain <70% even with instruction tuning (Zhang et al., 23 Apr 2025, Wang et al., 2024, Song et al., 2023).
  • On the HVSBench, all mLLMs lag far behind human-level alignment in visual saliency, subitizing, and free-viewing scanpath, exposing fundamental gaps in human-like perception (Lin et al., 2024).
  • Robustness to rare modalities, real-world data domain shift, and multi-hop chain-of-thought reasoning (visual, temporal, analogical, abduction) lags behind single-modality SOTA (Wang et al., 2024, Fan et al., 3 Jun 2025).

6. Applications, Personalization, and Embodiment

mLLMs equip agents for diverse domains:

  • Object detection and scene understanding in transportation, leveraging visual, thermal, and language signals in safety-critical contexts (Ashqar et al., 2024).
  • Scientific and biomedical image analysis, integrating structured omics, code, and microscopy for automation of semantic extraction and pipeline control (Zhang et al., 2024).
  • Personalized AI, using instruction, alignment, and fine-tuning frameworks to adapt representation and interaction to individual users, yielding improvements in recommendation, retrieval, and personalized generation (Wu et al., 2024, Ye et al., 2024).
  • Embodied multisensory reasoning, coupling mLLMs with representation modules for internal and external embodiment, facilitating physically grounded, prosocial, and homeostatic intelligent agents (Kadambi et al., 11 Oct 2025).
  • Real-time, multimodal dialog and accessibility: Automated alt-text, visual dialog, and AR/VR integration (Liang et al., 2024, Carolan et al., 2024).

7. Open Challenges and Future Research Directions

Active research topics include:

By aligning architectures, optimization, and evaluation with these axes, mLLMs are converging on the ultimate vision of general-purpose, adaptive, safe, and interpretable artificial intelligence able to understand, generate, and interact across the full spectrum of human modalities (Han et al., 29 May 2025, Liang et al., 2024, Wang et al., 2024, Song et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Large Language Models (mLLMs).