Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 91 tok/s
Gemini 3.0 Pro 55 tok/s
Gemini 2.5 Flash 173 tok/s Pro
Kimi K2 194 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Multimodal Large Language Models

Updated 16 November 2025
  • Multimodal Large Language Models are neural architectures that integrate diverse data modalities using specialized encoders and fusion modules.
  • They employ techniques such as early fusion, cross-attention, and Q-Former mechanisms to align and combine text, image, audio, and video inputs effectively.
  • Applications span visual question answering, document understanding, and generative tasks, while ongoing research addresses challenges in data quality, interpretability, and scalability.

Multimodal LLMs (MLLMs) are neural architectures that extend traditional LLMs by incorporating input and output from multiple data modalities—primarily text, images, audio, and video—and explicitly model their joint or conditional relationships. MLLMs unite the natural language capabilities of LLMs with perception modules, projection bridges, and parameter-efficient adaptation layers, thereby enabling cross-modal reasoning, perception, and generation within a unified computational framework. Their development has fueled significant advances in visual question answering, vision-language navigation, document understanding, multimodal search, assistive agents, and generative systems producing images, music, video, 3D objects, and beyond.

1. Architectural Foundations and Modal Fusion

At the core of MLLMs is a modular composition of modality-specific encoders (e.g., a Vision Transformer for images, HuBERT/Whisper for audio, text token embedding layers), fusion or alignment modules (such as learned projections, Q-Formers, or cross-attention blocks), and a (typically frozen) autoregressive LLM backbone (Wu et al., 2023, Liang et al., 9 Nov 2024, Carolan et al., 28 Mar 2024, Zhang et al., 24 Jan 2024). The dominant strategies for integrating multi-modal inputs involve:

  • Early Fusion (Single-stream): Concatenate text tokens and projected visual/auditory tokens into a joint sequence, processed by the LLM's native self-attention mechanism. This approach is used in models such as LLaVA (Carolan et al., 28 Mar 2024), where projected image patches are prepended to the LLM input and processed jointly.
  • Late Fusion (Dual-Stream): Separate towers for each modality; cross-modal information is exchanged only at the prediction or decision layer, e.g., via dot-product similarity or shallow cross-attention.
  • Hybrid and Query-based Fusion: BLIP-2 introduces the Q-Former—a lightweight cross-attention transformer with learnable queries—to “compress” high-dimensional visual features into a fixed-length token set that can be mapped into the LLM embedding space (Song et al., 2023, Tran et al., 11 Apr 2024). This Q-former motif is widely reused across strong MLLMs.
  • Cross-modal Attention: Models such as Flamingo or GPT-4V insert cross-attention blocks at each LLM layer, where the textual hidden states attend over projected modality representations via

Attention(Q,K,V)=softmax(QKd)V\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V

(Liang et al., 9 Nov 2024, Wu et al., 2023).

Variants such as the Visual Merger Module (used in MammothModa (She et al., 26 Jun 2024)) reduce computational burden by spatially pooling or token-pruning the vision features before fusion. Shared or per-frame position encodings address scalability for high-resolution or video inputs.

2. Alignment, Training Paradigms, and Objective Functions

The core challenge in MLLM design is modality alignment: bridging the representational gap between discrete, variable-length text tokens and continuous visual/audio features. The principal alignment modules and training strategies are:

  • Linear/MLP Projection: Mapping modality features into the LLM's embedding space with minimal parameter count. Typical in LLaVA and LaVy (Tran et al., 11 Apr 2024).
  • Cross-Attention / Q-Former: Cross-attend learnable queries to modality features (BLIP-2 (Wu et al., 2023), MiniGPT-4).
  • Adapters and LoRA: Parameter-efficient fine-tuning (PEFT) inserts adapters or low-rank matrix updates into the LLM backbone to support new modalities without catastrophic forgetting (Zhang et al., 24 Jan 2024, Carolan et al., 28 Mar 2024).
  • Contrastive Pre-training: Modality alignment is often supervised via contrastive InfoNCE or similar loss functions, e.g.,

Lcontrast=i=1Nlogexp(sim(xi,yi)/τ)j=1Nexp(sim(xi,yj)/τ)\mathcal{L}_\text{contrast} = - \sum_{i=1}^N \log\frac{\exp(\mathrm{sim}(x_i, y_i)/\tau)}{\sum_{j=1}^N \exp(\mathrm{sim}(x_i, y_j)/\tau)}

forcing semantically matched modality pairs close in a shared latent space (Carolan et al., 28 Mar 2024, Wu et al., 2023).

  • Auto-regressive and Generative Losses: For text (or image, audio) generation, language modeling cross-entropy is applied to predict tokens conditional on multimodal context.

Multi-stage training pipelines are now standard: an initial cross-modal alignment stage on large paired datasets (web-scraped CC3M, LAION, DataComp, etc.) is followed by curated multimodal instruction fine-tuning, which may incorporate human- or GPT-generated dialogues, OCR-rich image instructions, or domain-specialized data (Caffagni et al., 19 Feb 2024, Wu et al., 2023).

  • Instruction Tuning and RLHF: Alignment and SFT losses can be refined with reward modeling and RLHF—well-established in text-only LLMs, now increasingly applied for multimodal reward (ImageReward, VideoReward, DPO) (Han et al., 29 May 2025).

3. Modalities, Tasks, and Representative Models

MLLMs now routinely span text, images, audio, video, 3D point clouds, motion, and even physiological data (Wang et al., 2 Aug 2024). Key model exemplars and capabilities:

  • Text–Image: VQA (VQA-v2, OKVQA), captioning (COCO, CICER), retrieval (Recall@K), and grounding (RefCOCO) (Song et al., 2023, Carolan et al., 28 Mar 2024). LLaVA, BLIP-2, MiniGPT-4, Qwen-VL.
  • Text–Video: Video-LLaVA, X-InstructBLIP, Valley; benchmarks: MSR-VTT, WebVid2M, VideoInstruct100K (Wang et al., 2 Aug 2024).
  • Text–Audio: Qwen-Audio, SALMONN, SpeechGPT; dataset examples: AudioCaps, LibriSpeech (Wang et al., 2 Aug 2024).
  • 3D/Point Cloud: 3D-LLM, PointLLM, HEGALE; tasks include object classification, retrieval, and VQA (Chen et al., 20 Feb 2024).
  • Motion and Music: MotionGPT-2, MusicGen, Jukebox (Han et al., 29 May 2025).
  • Graphs: MLaGA extends MLLM capabilities to reasoning on multimodal attributed graphs through structure-aware encoders and instruction-coded prompting (Fan et al., 3 Jun 2025).

Prominent architectures are summarized in the table below.

Model Modalities Alignment Module Key Tasks
LLaVA T, I Linear proj VQA, Caption, Dialogue
BLIP-2 T, I Q-Former VQA, Caption, Retrieval
SALMONN T, A Q-Former Audio QA, Caption
3D-LLM T, P Q-Former 3D Object QA
Video-LLaVA T, I, V Q-Former Video QA
MLaGA T, I, G Cross-Attn,Mulit Graph QA, Link Pred

T: text, I: image, A: audio, V: video, P: point-cloud, G: graphs.

4. Scaling Laws, Efficiency, and Model Composition

Scaling MLLMs necessitates attention to compute efficiency, memory footprint, and extensibility across modalities.

  • Compute Efficiency: Self-attention across joint text–vision tokens incurs quadratic cost; innovations include composite attention mechanisms (EE-MLLM (Ma et al., 21 Aug 2024)) and visual mergers to reduce spatial token counts (She et al., 26 Jun 2024). Efficient methods can achieve 3× speedup in KV-cache prefilling (EE-MLLM: 79 ms vs. LLaVA: 277 ms at high-res input).
  • Parameter Efficiency: By freezing large backbones and only updating projection or adapter parameters (<2% of total), modern MLLMs permit rapid domain adaptation and minimize GPU requirements (Zhang et al., 24 Jan 2024).
  • Model Composition: Instead of costly joint re-training for new modalities, composing pre-trained modal experts by merging modality-specific encoders and adaptively combining LLM weights can yield universal MLLMs (“DAMC,” “NaiveMC”; (Chen et al., 20 Feb 2024)). Dual-stream decoupling mitigates performance loss due to cross-modal parameter interference.
  • Mobile/Edge Deployment: Models under 1 B parameters (Vintern-1B, LaVy) demonstrate feasible on-device deployment with language and vision support while maintaining competitive accuracy on regional benchmarks (Doan et al., 22 Aug 2024, Tran et al., 11 Apr 2024).

5. Benchmarks, Metrics, and Performance Synthesis

Benchmarking MLLMs requires multi-faceted evaluation, including:

Recent MLLMs outperform prior art:

  • On MUSIC-AVQA (Audio+Visual), DAMC achieves up to 57.32% (V+I+A), +19% over baseline.
  • On OpenViVQA (Vietnamese), LaVy achieves 35.2% accuracy, +7.3 pts over BLOOMZ-7B (Tran et al., 11 Apr 2024).
  • On Fashion200K, large gains in recall for multimodal retrieval (model in (Barbany et al., 24 Apr 2024)).
  • On MMBench, EE-MLLM outpaces LLaVA-v1.6 by 3.9 pts in English (Ma et al., 21 Aug 2024).

6. Open Problems, Limitations, and Directions

Despite rapid progress, MLLMs face unresolved challenges:

  • Modality Alignment: Semantic “gaps” between modalities can yield hallucination, especially when continuous perceptual features (pixels, waveforms) are mapped to discrete token embedding spaces (Song et al., 2023, Carolan et al., 28 Mar 2024). Cross-modal contrast and self-supervised objectives mitigate but do not solve this fully.
  • Data Scarcity and Quality: High-quality multimodal data (especially for underrepresented domains or non-English contexts) remains a bottleneck; noisy web-crawled datasets can induce bias or degrade performance (Wang et al., 2 Aug 2024).
  • Interpretability and Debuggability: Multilayer fusion and deep attention produce non-transparent inference. Region-centric modules (MedRegA (Wang et al., 24 Oct 2024)) and region-based evaluation offer a promising path for visual grounding and explanation.
  • Scaling Laws and Robustness: Model and data scaling empirically improve downstream performance but with diminishing returns beyond certain thresholds. Improved data curation and self-supervision may yield greater dividends (Zhang et al., 24 Jan 2024, Xia et al., 2023).
  • Ethical/Privacy Considerations: Multimodal systems inherit data bias and privacy issues; hallucination, interpretability, deepfake generation, and model misuse necessitate robust filters, human-in-the-loop auditing, and differential privacy mechanisms (Song et al., 2023, Liang et al., 9 Nov 2024, Wang et al., 2 Aug 2024).

Promising avenues for research include self-supervised learning for video/motion/3D, mixture-of-experts architectures for scalable fusion, graph-centric alignment methods, retrieval-augmented RAG for grounded generation, richer instruction-tuning pipelines, and continual/lifelong multimodal learning.

7. Impact, Applications, and Generalization

MLLMs underpin an expanding suite of AI capabilities:

  • Visual dialog, AI tutoring, content and document understanding, translation, voice assistants, interactive navigation and robotics, creative generation (images, music, video, 3D), and medical AI (reporting, region detection) (Wang et al., 24 Oct 2024, Barbany et al., 24 Apr 2024, Chen et al., 10 Jan 2025).
  • Specialized and regional models (LaVy, Vintern-1B) demonstrate strong performance in low-resource or non-English environments, broadening global accessibility.
  • Modular and compositional approaches (DAMC, NaiveMC) facilitate rapid scaling to new modalities with minimal retraining, allowing the field to evolve toward “any-to-any” multimodal agents (Chen et al., 20 Feb 2024).

MLLMs thus represent a significant advancement in AI, enabling integrated perception, multimodal reasoning, and cross-domain generalization. Their continuing evolution will be shaped by advances in scalable architectures, interpretability, data curation, and a deeper understanding of multimodal alignment and reasoning mechanisms.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multimodal Large Language Model.