Multimodal Large Language Models (mLLMs)

Updated 18 December 2025

Multimodal Large Language Models (mLLMs) are foundation models that fuse language models with modality-specific encoders to process text, images, audio, and more.
They leverage cross-modal representation learning through early, intermediate, and late fusion mechanisms, aligning diverse signals into a shared embedding space.
Applications range from visual dialogue and captioning to embodied reasoning, while challenges include robustness, interpretability, and efficient multimodal training.

Multimodal LLMs (mLLMs) are foundation models designed to process, align, and generate signals across multiple data modalities, including text, images, audio, video, and 3D structure. By extending the successes of text-only LLMs such as GPT, LLaMA, and Vicuna, mLLMs leverage cross-modal representation learning, reasoning, and large-scale pretraining. This unified approach enables the development of general-purpose, instruction-following agents with capabilities previously unattainable by single-modality systems, supporting a spectrum of applications from visual dialogue and captioning to cross-modal generation and embodied reasoning (Han et al., 29 May 2025, Carolan et al., 28 Mar 2024, Song et al., 2023).

1. Formal Definition, Scope, and Motivations

A Multimodal LLM (mLLM) is a neural architecture that fuses a pretrained LLM with modality-specific encoders (e.g., vision, audio) and corresponding fusion modules, training all or parts of the resulting model such that it can perform cross-modal understanding and generation. The canonical mLLM defines a conditional joint distribution

$p(y\,|\,x_{\text{text}},\,x_{\text{vis}},\,x_{\text{aud}},\,\ldots)$

where $x_{*}$ are inputs from various modalities, typically aligned into a shared latent or embedding space, and $y$ may correspond to a response in any target modality (Yin et al., 2023, Zhang et al., 24 Jan 2024, Wang et al., 2 Aug 2024, Han et al., 29 May 2025, Liang et al., 9 Nov 2024).

The motivation for mLLMs is grounded in the ubiquity of multimodal signals in human cognition and real-world data. mLLMs offer the following advantages:

Semantic grounding: Visual and acoustic cues provide a basis for language grounding, reducing the risk of unanchored generation (Wang et al., 2 Aug 2024).
Contextual enrichment: Modality fusion, e.g., adding video or audio, enhances contextual understanding and robustness (Carolan et al., 28 Mar 2024).
Cross-modal transfer: Shared embeddings support few-shot and zero-shot generalization to new tasks and domains (Han et al., 29 May 2025, Yin et al., 2023).
End-to-end integration: mLLMs consolidate pipelines previously requiring several task-specific models (e.g., ASR + image captioner + LLM) (Wang et al., 2 Aug 2024).

2. Architectural Paradigms and Core Fusion Mechanisms

The prevailing architecture for mLLMs follows an encoder–fusion–decoder pattern:

Modality encoders: Specialized neural networks (e.g., CLIP-ViT for images, HuBERT for audio, C-Former for video) map each input $x_i$ to feature vectors $E_i(x_i) \in \mathbb{R}^{n_i \times d}$ (Wang et al., 2 Aug 2024, Zhang et al., 24 Jan 2024, Fan et al., 3 Jun 2025).
Projectors/connectors: Linear or MLP projections align encoder outputs into the LLM's embedding space. Adapter-based variants insert mapping layers or LoRA modules at selected transformer blocks of the backbone (Carolan et al., 28 Mar 2024, Song et al., 2023, Chen et al., 20 Feb 2024).
Multi-modal fusion: Principal mechanisms include
- Early fusion (concatenation of embeddings),
- Intermediate fusion (inserting cross-modal attention into transformer layers),
- Late fusion (separate streams combined at the classifier head),
- Q-Former/perceiver (learnable query tokens attending to modality-specific sequences via cross-attention) (Song et al., 2023, Wang et al., 2 Aug 2024, Fan et al., 3 Jun 2025).
Backbone LLM: Autoregressive LLMs such as LLaMA, GPT-3/4, Vicuna, Mistral, or instruction-finetuned derivatives serve as the core for reasoning and generation (Carolan et al., 28 Mar 2024, Caffagni et al., 19 Feb 2024).

The cross-modal attention operation is formally defined as:

$\mathrm{Attn}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$

where $K,V$ may originate from image, audio, text, or fused modality embeddings (Liang et al., 9 Nov 2024, Fan et al., 3 Jun 2025).

3. Training Paradigms, Objectives, and Optimization

The dominant training pipelines for mLLMs utilize multistage recipes:

Self-supervised pretraining:
- Contrastive alignment: Aligns paired image–text (or other modalities) in a joint embedding space using CLIP-style losses
$\mathcal{L}_{\mathrm{CLIP}} = -\mathbb{E}_{(x_t,x_i)}\left[\log\frac{\exp(\mathrm{sim}(f_t(x_t), f_i(x_i))/\tau)}{\sum_{x'_i}\exp(\mathrm{sim}(f_t(x_t), f_i(x'_i))/\tau)}\right]$

(Carolan et al., 28 Mar 2024, Han et al., 29 May 2025, Song et al., 2023, Zhang et al., 24 Jan 2024). - Generative alignment: Autoregressive loss on the next-token or target output, optionally conditioned on multimodal context

$\mathcal{L}_{\mathrm{LM}} = -\sum_t \log p(y_t | y_{<t}, E_{\mathrm{modality}}, E_{\mathrm{text}})$

(Han et al., 29 May 2025, Carolan et al., 28 Mar 2024).
Supervised and instruction tuning:
- Training on synthetic or human-curated multimodal instructions and dialogs (e.g., LLaVA-Instruct, LVIS, ShareGPT, LRV-Instructions).
- Chain-of-thought prompting is used for compositional reasoning (Yin et al., 2023, Wang et al., 10 Jan 2024, Han et al., 29 May 2025).
Reinforcement learning from human feedback (RLHF):
- Reward models are trained with human or domain-specific preference data; alignment objectives refine preference-consistent, safe, and faithful outputs (Han et al., 29 May 2025, Carolan et al., 28 Mar 2024).

Parameter-efficient fine-tuning (PEFT) variants (e.g., LoRA, adapters, prefix-tuning) and retrieval-augmented generation (RAG) are frequently used to enable efficient adaptation or domain transfer (Carolan et al., 28 Mar 2024, Zhang et al., 29 Jul 2024, Zhang et al., 24 Jan 2024).

4. Functional Taxonomy and Coverage of Modalities

mLLMs are classified both by the supported modalities and by the direction of cross-modal mapping. A representative taxonomy includes:

Input	Output	Examples	Challenges
Text	Text	GPT-4, LLaMA-2, InstructGPT	Context window constraints, hallucination
Text	Image	DALL·E 2/3, Stable Diffusion	Prompt compositionality, bias
Text	Audio	MusicLM, AudioLDM	Long-range structure, subjective quality
Text	Video	Lumiere, Sora, Make-A-Video	Temporal consistency, data scarcity
Text	3D	DreamFusion, Shap-E, Magic3D	Multi-view geometry, rendering realism
Multi-input	Multi-out	NExT-GPT, ModaVerse	Generalization, modality interaction

(Han et al., 29 May 2025, Zhang et al., 24 Jan 2024, Liang et al., 9 Nov 2024, Caffagni et al., 19 Feb 2024)

Recent models handle not just static image+text, but video, audio, human motion, point clouds, multimodal graphs, and complex document layouts (Wang et al., 2 Aug 2024, Fan et al., 3 Jun 2025, Zhang et al., 29 Jul 2024, Chen et al., 20 Feb 2024).

Hybrid methods leverage model composition, enabling the synthesis of new mLLMs from expert submodels without full retraining, e.g., via parameter decoupling/merging (DAMC) for arbitrary modality sets (Chen et al., 20 Feb 2024). Mixture-of-Experts (MoE), MoE diffusion, and modular adaptation further extend capacity and specialization (Han et al., 29 May 2025, Liang et al., 9 Nov 2024).

5. Empirical Benchmarks, Reasoning Capabilities, and Limitations

mLLMs are evaluated on a range of benchmarks:

Image Captioning: COCO Captions (CIDEr, BLEU), NoCaps, Flickr30k.
Visual Question Answering: VQA v2, OK-VQA, ScienceQA, GQA, MathVista.
Retrieval & Grounding: Recall@k, region mAP on RefCOCO, TextVQA, TextCaps, DocVQA.
Audio & Video: AudioCaps (WER, CIDEr), MSRVTT-QA (Acc), LibriSpeech (ASR), MELD/MOSEI (F1).
Reasoning: MMMU, MMBench, MM-VET, InfiMM/EmbodiedEval for rich multimodal reasoning (Wang et al., 2 Aug 2024, Wang et al., 10 Jan 2024, Fan et al., 3 Jun 2025).

Key findings:

State-of-the-art generic mLLMs (e.g., BLIP-2, LLaVA, MiniGPT-4) reach 75–85% on VQAv2 and COCO CIDEr ∼130–148, but complex cognitive tasks—reasoning about intent, emotion, or high-level communication—remain <70% even with instruction tuning (Zhang et al., 23 Apr 2025, Wang et al., 2 Aug 2024, Song et al., 2023).
On the HVSBench, all mLLMs lag far behind human-level alignment in visual saliency, subitizing, and free-viewing scanpath, exposing fundamental gaps in human-like perception (Lin et al., 12 Dec 2024).
Robustness to rare modalities, real-world data domain shift, and multi-hop chain-of-thought reasoning (visual, temporal, analogical, abduction) lags behind single-modality SOTA (Wang et al., 10 Jan 2024, Fan et al., 3 Jun 2025).

6. Applications, Personalization, and Embodiment

mLLMs equip agents for diverse domains:

Object detection and scene understanding in transportation, leveraging visual, thermal, and language signals in safety-critical contexts (Ashqar et al., 26 Sep 2024).
Scientific and biomedical image analysis, integrating structured omics, code, and microscopy for automation of semantic extraction and pipeline control (Zhang et al., 29 Jul 2024).
Personalized AI, using instruction, alignment, and fine-tuning frameworks to adapt representation and interaction to individual users, yielding improvements in recommendation, retrieval, and personalized generation (Wu et al., 3 Dec 2024, Ye et al., 19 Aug 2024).
Embodied multisensory reasoning, coupling mLLMs with representation modules for internal and external embodiment, facilitating physically grounded, prosocial, and homeostatic intelligent agents (Kadambi et al., 11 Oct 2025).
Real-time, multimodal dialog and accessibility: Automated alt-text, visual dialog, and AR/VR integration (Liang et al., 9 Nov 2024, Carolan et al., 28 Mar 2024).

7. Open Challenges and Future Research Directions

Active research topics include:

Unified representation and fusion: Generalizing beyond Q-Formers and modality-specific encoders to models natively supporting arbitrary semantic modalities and data types (Song et al., 2023, Han et al., 29 May 2025).
Robustness and interpretability: Addressing multimodal hallucination, compositionality gaps, explainability of cross-modal decisions, and rigorous calibration of model confidence (Wang et al., 2 Aug 2024, Wang et al., 10 Jan 2024, Lin et al., 12 Dec 2024).
Efficiency and continual adaptation: LoRA, PEFT, model compression, and modular architectures targeting energy- and memory-efficient inference (Zhang et al., 24 Jan 2024, Liang et al., 9 Nov 2024).
Benchmarks and principled evaluation: Need for richer, more unified multitask and cross-modal reasoning benchmarks (e.g., InfiMM-Eval, HVSBench, MCUB), with robust open-ended, explainable, and adversarial scenarios (Zhang et al., 23 Apr 2025, Lin et al., 12 Dec 2024, Chen et al., 20 Feb 2024).
Ethics, privacy, and safety: Mitigating biases, out-of-distribution risks, privacy exposures from pretraining, and deepfake proliferation, with frameworks for certification and regulatory compliance (Liang et al., 9 Nov 2024, Carolan et al., 28 Mar 2024).
Embodied and prosocial AI: Integrating explicit internal state, homeostasis, and sensorimotor feedback loops for safer, more human-aligned global agents (Kadambi et al., 11 Oct 2025).

By aligning architectures, optimization, and evaluation with these axes, mLLMs are converging on the ultimate vision of general-purpose, adaptive, safe, and interpretable artificial intelligence able to understand, generate, and interact across the full spectrum of human modalities (Han et al., 29 May 2025, Liang et al., 9 Nov 2024, Wang et al., 10 Jan 2024, Song et al., 2023).