Papers
Topics
Authors
Recent
2000 character limit reached

Multimodal Large Language Models (mLLMs)

Updated 18 December 2025
  • Multimodal Large Language Models (mLLMs) are foundation models that fuse language models with modality-specific encoders to process text, images, audio, and more.
  • They leverage cross-modal representation learning through early, intermediate, and late fusion mechanisms, aligning diverse signals into a shared embedding space.
  • Applications range from visual dialogue and captioning to embodied reasoning, while challenges include robustness, interpretability, and efficient multimodal training.

Multimodal LLMs (mLLMs) are foundation models designed to process, align, and generate signals across multiple data modalities, including text, images, audio, video, and 3D structure. By extending the successes of text-only LLMs such as GPT, LLaMA, and Vicuna, mLLMs leverage cross-modal representation learning, reasoning, and large-scale pretraining. This unified approach enables the development of general-purpose, instruction-following agents with capabilities previously unattainable by single-modality systems, supporting a spectrum of applications from visual dialogue and captioning to cross-modal generation and embodied reasoning (Han et al., 29 May 2025, Carolan et al., 28 Mar 2024, Song et al., 2023).

1. Formal Definition, Scope, and Motivations

A Multimodal LLM (mLLM) is a neural architecture that fuses a pretrained LLM with modality-specific encoders (e.g., vision, audio) and corresponding fusion modules, training all or parts of the resulting model such that it can perform cross-modal understanding and generation. The canonical mLLM defines a conditional joint distribution

p(yxtext,xvis,xaud,)p(y\,|\,x_{\text{text}},\,x_{\text{vis}},\,x_{\text{aud}},\,\ldots)

where xx_{*} are inputs from various modalities, typically aligned into a shared latent or embedding space, and yy may correspond to a response in any target modality (Yin et al., 2023, Zhang et al., 24 Jan 2024, Wang et al., 2 Aug 2024, Han et al., 29 May 2025, Liang et al., 9 Nov 2024).

The motivation for mLLMs is grounded in the ubiquity of multimodal signals in human cognition and real-world data. mLLMs offer the following advantages:

  • Semantic grounding: Visual and acoustic cues provide a basis for language grounding, reducing the risk of unanchored generation (Wang et al., 2 Aug 2024).
  • Contextual enrichment: Modality fusion, e.g., adding video or audio, enhances contextual understanding and robustness (Carolan et al., 28 Mar 2024).
  • Cross-modal transfer: Shared embeddings support few-shot and zero-shot generalization to new tasks and domains (Han et al., 29 May 2025, Yin et al., 2023).
  • End-to-end integration: mLLMs consolidate pipelines previously requiring several task-specific models (e.g., ASR + image captioner + LLM) (Wang et al., 2 Aug 2024).

2. Architectural Paradigms and Core Fusion Mechanisms

The prevailing architecture for mLLMs follows an encoder–fusion–decoder pattern:

The cross-modal attention operation is formally defined as:

Attn(Q,K,V)=softmax(QKdk)V\mathrm{Attn}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

where K,VK,V may originate from image, audio, text, or fused modality embeddings (Liang et al., 9 Nov 2024, Fan et al., 3 Jun 2025).

3. Training Paradigms, Objectives, and Optimization

The dominant training pipelines for mLLMs utilize multistage recipes:

  1. Self-supervised pretraining:
    • Contrastive alignment: Aligns paired image–text (or other modalities) in a joint embedding space using CLIP-style losses

    LCLIP=E(xt,xi)[logexp(sim(ft(xt),fi(xi))/τ)xiexp(sim(ft(xt),fi(xi))/τ)]\mathcal{L}_{\mathrm{CLIP}} = -\mathbb{E}_{(x_t,x_i)}\left[\log\frac{\exp(\mathrm{sim}(f_t(x_t), f_i(x_i))/\tau)}{\sum_{x'_i}\exp(\mathrm{sim}(f_t(x_t), f_i(x'_i))/\tau)}\right]

    (Carolan et al., 28 Mar 2024, Han et al., 29 May 2025, Song et al., 2023, Zhang et al., 24 Jan 2024). - Generative alignment: Autoregressive loss on the next-token or target output, optionally conditioned on multimodal context

    LLM=tlogp(yty<t,Emodality,Etext)\mathcal{L}_{\mathrm{LM}} = -\sum_t \log p(y_t | y_{<t}, E_{\mathrm{modality}}, E_{\mathrm{text}})

    (Han et al., 29 May 2025, Carolan et al., 28 Mar 2024).

  2. Supervised and instruction tuning:

  3. Reinforcement learning from human feedback (RLHF):

Parameter-efficient fine-tuning (PEFT) variants (e.g., LoRA, adapters, prefix-tuning) and retrieval-augmented generation (RAG) are frequently used to enable efficient adaptation or domain transfer (Carolan et al., 28 Mar 2024, Zhang et al., 29 Jul 2024, Zhang et al., 24 Jan 2024).

4. Functional Taxonomy and Coverage of Modalities

mLLMs are classified both by the supported modalities and by the direction of cross-modal mapping. A representative taxonomy includes:

Input Output Examples Challenges
Text Text GPT-4, LLaMA-2, InstructGPT Context window constraints, hallucination
Text Image DALL·E 2/3, Stable Diffusion Prompt compositionality, bias
Text Audio MusicLM, AudioLDM Long-range structure, subjective quality
Text Video Lumiere, Sora, Make-A-Video Temporal consistency, data scarcity
Text 3D DreamFusion, Shap-E, Magic3D Multi-view geometry, rendering realism
Multi-input Multi-out NExT-GPT, ModaVerse Generalization, modality interaction

(Han et al., 29 May 2025, Zhang et al., 24 Jan 2024, Liang et al., 9 Nov 2024, Caffagni et al., 19 Feb 2024)

Recent models handle not just static image+text, but video, audio, human motion, point clouds, multimodal graphs, and complex document layouts (Wang et al., 2 Aug 2024, Fan et al., 3 Jun 2025, Zhang et al., 29 Jul 2024, Chen et al., 20 Feb 2024).

Hybrid methods leverage model composition, enabling the synthesis of new mLLMs from expert submodels without full retraining, e.g., via parameter decoupling/merging (DAMC) for arbitrary modality sets (Chen et al., 20 Feb 2024). Mixture-of-Experts (MoE), MoE diffusion, and modular adaptation further extend capacity and specialization (Han et al., 29 May 2025, Liang et al., 9 Nov 2024).

5. Empirical Benchmarks, Reasoning Capabilities, and Limitations

mLLMs are evaluated on a range of benchmarks:

  • Image Captioning: COCO Captions (CIDEr, BLEU), NoCaps, Flickr30k.
  • Visual Question Answering: VQA v2, OK-VQA, ScienceQA, GQA, MathVista.
  • Retrieval & Grounding: Recall@k, region mAP on RefCOCO, TextVQA, TextCaps, DocVQA.
  • Audio & Video: AudioCaps (WER, CIDEr), MSRVTT-QA (Acc), LibriSpeech (ASR), MELD/MOSEI (F1).
  • Reasoning: MMMU, MMBench, MM-VET, InfiMM/EmbodiedEval for rich multimodal reasoning (Wang et al., 2 Aug 2024, Wang et al., 10 Jan 2024, Fan et al., 3 Jun 2025).

Key findings:

6. Applications, Personalization, and Embodiment

mLLMs equip agents for diverse domains:

  • Object detection and scene understanding in transportation, leveraging visual, thermal, and language signals in safety-critical contexts (Ashqar et al., 26 Sep 2024).
  • Scientific and biomedical image analysis, integrating structured omics, code, and microscopy for automation of semantic extraction and pipeline control (Zhang et al., 29 Jul 2024).
  • Personalized AI, using instruction, alignment, and fine-tuning frameworks to adapt representation and interaction to individual users, yielding improvements in recommendation, retrieval, and personalized generation (Wu et al., 3 Dec 2024, Ye et al., 19 Aug 2024).
  • Embodied multisensory reasoning, coupling mLLMs with representation modules for internal and external embodiment, facilitating physically grounded, prosocial, and homeostatic intelligent agents (Kadambi et al., 11 Oct 2025).
  • Real-time, multimodal dialog and accessibility: Automated alt-text, visual dialog, and AR/VR integration (Liang et al., 9 Nov 2024, Carolan et al., 28 Mar 2024).

7. Open Challenges and Future Research Directions

Active research topics include:

By aligning architectures, optimization, and evaluation with these axes, mLLMs are converging on the ultimate vision of general-purpose, adaptive, safe, and interpretable artificial intelligence able to understand, generate, and interact across the full spectrum of human modalities (Han et al., 29 May 2025, Liang et al., 9 Nov 2024, Wang et al., 10 Jan 2024, Song et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multimodal Large Language Models (mLLMs).