Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 159 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 22 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Multimodal Large Language Models

Updated 11 November 2025
  • Multimodal Large Language Models are neural architectures that process text, images, video, and audio to enable unified cross-modal understanding.
  • They combine modality-specific encoders with a frozen LLM backbone using adapter modules and fusion layers for efficient data integration.
  • Advancements leverage contrastive pre-training, instruction tuning, and scalable architectures to support tasks like visual Q&A and text-to-video synthesis.

Multimodal LLMs (MM-LLMs) are a class of neural architectures that extend large-scale LLMs to incorporate non-textual modalities—including images, video, audio, and others—enabling cross-modal understanding, reasoning, and generation. MM-LLMs represent a convergence of progress in contrastive and generative modeling, transformer architectures, parameter-efficient transfer, and large-scale multimodal corpus curation. The field has rapidly evolved to support diverse applications, ranging from visual question answering and chart interpretation to text-conditioned video and audio synthesis, all within a unified, instruction-following framework.

1. Definitions, Scope, and Taxonomy

MM-LLMs generalize pre-trained text-only LLMs by equipping them with additional modality-specific encoders (e.g., vision transformers, audio encoders) and adapters. The defining characteristic is the ability to jointly process, align, and generate across heterogeneous input/output streams—including but not limited to:

  • Text (natural language tokens)
  • Images (patch-based embeddings or region descriptors)
  • Video (spatio-temporal or frame sequences)
  • Audio (raw waveform, spectrogram, or discrete tokens)
  • Higher-dimensional data (motion capture, 3D objects, physiological signals)

A taxonomy of MM-LLMs can be organized along functional and architectural axes:

Functional Branch Modalities Representative Example
MM to Text (VL) I+T→T, V+T→T, ... BLIP-2, LLaVA, VideoChat, Qwen-Audio
MM to MM (generation) T→I, T→V, ... Emu, GILL, Kosmos-2, NExT-GPT, CoDi-2
Any-to-Any All combinations GPT-4(V), Gemini, ModaVerse
Tool-using Agents MM + API/Tool HuggingGPT, EmbodiedGPT, Visual-ChatGPT

(She et al., 26 Jun 2024, Han et al., 29 May 2025, Caffagni et al., 19 Feb 2024, Carolan et al., 28 Mar 2024, Zhang et al., 24 Jan 2024)

2. Core Architectural Principles

MM-LLMs consistently adopt a modular architecture composed of:

  • Frozen pre-trained backbone LLM: typically a large transformer (e.g., LLaMA, Vicuna, Flan-T5, Qwen), which remains mostly unchanged, with lightweight adapters inserted as required.
  • Modality encoders: vision transformers (ViT, EVA-CLIP), audio encoders (wav2vec), or video-specific backbones generate intermediate features from raw modal input. These are often frozen to preserve localization and semantic structure acquired during upstream contrastive pre-training.
  • Adapter/projection modules: linear layers, MLPs, or Q-Formers project modality-specific encodings into the LLM’s embedding space.
  • Fusion mechanism: cross-modal attention blocks, interleaved with self-attention, permit weighted integration of visual/audio and text tokens. Adapter design variants include single-layer projectors (MammothModa), Q-Formers, or more elaborate cross-attention interleaving (Flamingo-type).

For image and video tasks, context window extension is a major consideration. Techniques such as Visual Merger (mean-pooling to reduce visual token count) and Frame Position IDs (shared position embedding per frame) are adopted to allow scalability to high-resolution and long-duration inputs, without overburdening positional encoding tables or incurring prohibitive sequence lengths (She et al., 26 Jun 2024, Zou et al., 27 Sep 2024).

Component Example Realization
Vision Encoder CLIP-ViT-L/14 (frozen), GLHR tiling
Adapter Linear projection, Q-Former, 2-layer MLP
LLM Backbone LLaMA, Vicuna, Flan-T5, Qwen
Fusion Layer Cross-modal self-attention, Visual Experts

3. Training Methodologies and Data Curation

MM-LLMs are typically trained by a multi-phase pipeline designed to scale across both generalization and fine-grained cross-modal alignment:

  1. Contrastive Pre-training: The modality encoders are (pre-)trained using contrastive losses (e.g., CLIP’s InfoNCE). For vision, this involves aligning image patches or tiles with corresponding text descriptions.
  2. Vision-Language Alignment Phase: The LLM’s fusion modules (adapter/projection + cross-modal attention) are trained with cross-entropy objectives over large image–text and/or video–text datasets. The visual encoder is mostly frozen, with layer-wise decayed learning rates if unfrozen.
  3. Multi-task/Instruction Tuning: The LLM, adapters, and MoE components are further tuned on diverse, instruction-formatted multimodal corpora. This phase involves both task-agnostic instructions (captioning, VQA) and specialized tasks (object grounding, chart VQA, visual math).
  4. Supervised Fine-Tuning: High-quality, manually-annotated or filtered (bilingual, hallucination-reduced) datasets are used to refine task-specific capabilities and reduce modal hallucinations.
  5. Parameter Efficient Fine-Tuning (PEFT): LoRA, QLoRA, or LayerNorm-tuning is preferred for practical scalability, allowing most LLM parameters to be frozen.

Data Curation emphasizes automatic language ID, coherence, and hallucination screening (via manual spot checks and automated metrics) to minimize instances where the visual content and text are misaligned. High-quality bilingual (EN/CN) curation is deployed to ensure generalization and minimize hallucination (She et al., 26 Jun 2024).

4. Foundational Learning Techniques and Modalities

MM-LLM training incorporates four foundational paradigms (Han et al., 29 May 2025):

  • Self-Supervised Learning (SSL): Masked modeling and contrastive losses. For masked modeling, Lmask=ExD[fθ(mask(x))x2]L_{\text{mask}} = \mathbb{E}_{x \sim D} [ \lVert f_\theta(\text{mask}(x)) - x \rVert^2 ]. For contrastive, see CLIP: Lcontrastive=i=1Nlogexp(sim(zi,zi+)/τ)j=1Nexp(sim(zi,zj)/τ)L_{\text{contrastive}} = - \sum_{i=1}^N \log \frac{\exp(\text{sim}(z_i, z_i^+)/\tau)}{\sum_{j=1}^N \exp(\text{sim}(z_i, z_j)/\tau)}.
  • Mixture-of-Experts (MoE): Task/adaptor-specialized routing via gated attention, easing modular extension.
  • Reinforcement Learning from Human Feedback (RLHF): Trains a reward model on human preference and optimizes via policy gradients (PPO).
  • Chain-of-Thought Prompting (CoT): Explicit stepwise reasoning in both text and non-text outputs, encouraging intermediate representations such as sketches, skeletons, or latent plans.

Generative Modalities extend far beyond text and images:

Modality Example Task Model Mechanisms
Text-to-Image (T2I) Image synthesis from text prompt Diffusion, autoregressive transformer
Text-to-Video (T2V) Video synthesis with temporal consistency Spatio-temporal transformer, Video Q-Former
Text-to-Music (T2M) Waveform/MIDI generation Transformer, latent diffusion
Text-to-3D (T2-3D) NeRF/mesh/point cloud generation 3D-aware diffusion, spec. distillation
Text-to-Human Motion Synthetic motion capture, animation Diffusion/tokenizer-based models

(Han et al., 29 May 2025, Caffagni et al., 19 Feb 2024)

5. Benchmarking and Empirical Performance

MM-LLMs are evaluated on a wide spectrum of multimodal benchmarks encompassing varied input length, complexity, and modality mix. Key tasks include:

  • Visual Question Answering (VQAv2, OKVQA, GQA, TextVQA, MMBench, HallucinationBench, POPE)
  • Image Captioning (MS COCO, NoCaps, Flickr30k; CIDEr, BLEU-4, CLIPScore)
  • Visual Grounding and OCR (RefCOCO, RefCOCO+, RefCOCOg, OCRBench)
  • Video QA and Generation (MSR-VTT-QA, VideoVista, LongVideoBench)
  • Chart and Math VQA (ChartQA, MathVista)
  • Multimodal generation (text→video, text→audio; AudioCaps, AudioGPT)

Performance of leading MM-LLMs (including MammothModa) on vision-language benchmarks is summarized as follows (She et al., 26 Jun 2024):

Method Avg MMBench MMStar MMMU MathVista Hall.Bench AI2D OCRBench MMVet
GPT-4o (2024-05) 69.9 82.2 63.9 69.2 61.3 55.0 84.6 736 69.1
MammothModa 61.2 81.04 56.27 54.4 54.7 44.57 81.0 614 56.06

MammothModa places in the top four on virtually all core multimodal leaderboards, without the need for elaborate architectural or training "bells and whistles". No explicit confidence intervals or statistical significance are reported (She et al., 26 Jun 2024).

6. Scalability, Limitations, and Open Challenges

Key architectural innovations (e.g., Visual Merger, FPID) enable MM-LLMs such as MammothModa to scale to high-resolution images and long-duration videos by compressing spatial and temporal input redundancies. The shared Frame Position ID strategy eschews costly positional interpolation, reducing computational requirements and aligning with the context window growth needs for multimodal data.

Identified limitations include:

  • Trade-off between fine-grained spatial/temporal discrimination and model scalability, as evidenced by small performance dips on tasks requiring high positional precision (She et al., 26 Jun 2024).
  • Simple pooling/merging techniques may discard task-critical local detail; future directions include adaptive or learned token merging.
  • Vision backbone is typically frozen; joint training or fine-tuned adapters could further enhance modal alignment and downstream flexibility.
  • Current bilingual strategies are limited to English/Chinese; extension to additional languages and code-switching scenarios is necessary for broader coverage.
  • Universal benchmarks for multimodal coherence (across image, video, audio, etc.) and robust evaluation metrics remain open challenges, as current scores (e.g., FID, CIDEr) do not reliably track cross-modal quality (Han et al., 29 May 2025, Zou et al., 27 Sep 2024).

7. Future Directions and Prospects

Advancements in MM-LLMs are contingent on progress in several domains:

  • Data and Corpus Expansion: Curation of high-quality, multilingual, and richly annotated multimodal corpora, including audio, 3D, and long-form video (She et al., 26 Jun 2024, Han et al., 29 May 2025).
  • Adaptive Modular Architectures: Increased exploration of mixture-of-experts backbones, modular expert specialization for domain or modality, and scalable expert routing (Li et al., 5 Aug 2024, Han et al., 29 May 2025).
  • Improved Alignment and Reasoning: New self-supervised objectives for underrepresented modalities, explicit chain-of-thought reasoning over non-text outputs, multi-grained region alignment (e.g., MMGiC), and pretext tasks targeted at cognitive-level semantics (Xu et al., 8 Dec 2024, Zhang et al., 23 Apr 2025).
  • Robustness, Hallucination, and Trustworthiness: Enhanced filtering, instruction tuning, and RLHF with a focus on reducing hallucinations and improving self-awareness in perception (Wang et al., 15 Jan 2024).
  • Efficient Inference, Serving, and Deployment: Adaptive serving paradigms (e.g., Elastic Multimodal Parallelism), parameter-efficient tuning via LoRA/QLoRA, scalable, low-latency pipelines for real-world mixed-modality applications (Liu et al., 14 Jul 2025).

A plausible implication is that as MM-LLMs continue to diversify modality coverage and reasoning capability—paired with robust, architecture-agnostic evaluation suites and cost-effective deployment—they will form the basis for universal, context-aware, and generative AI systems across language, vision, audio, and emergent data modalities.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multimodal Large Language Models (MM-LLMs).