Multimodal Language Models Overview

Updated 13 December 2025

Multimodal language models are computational systems that integrate various data modalities using modality-specific encoders fused via attention or concatenation strategies.
They employ staged training objectives, including contrastive pretraining and supervised fine-tuning, to align and generate coherent multimodal outputs.
Current models excel in tasks like VQA and image captioning yet face challenges in grounding relational and logical expressions.

A multimodal LLM (MLM) is a computational system that jointly ingests, represents, and generates data over two or more modalities—most frequently text, images, and audio, but increasingly video, human motion, and 3D representations as well. MLMs extend LLMs to integrate modality-specific encoders via fusion or attention mechanisms, enabling richer context and greater versatility in linguistic, perceptual, and reasoning tasks. These models are built on tightly coupled architectural, optimization, and data-alignment frameworks that allow for intra- and cross-modal representation learning and serve as the backbone of current advances in general-purpose AI.

1. Definition and Architectural Foundations

A multimodal LLM $f$ is formally defined as a parameterized function:

$\hat{Y} = f(V, A, T; \theta)$

where $V=\{v_1,..,v_{n_v}\}$ are visual inputs (image patches or frames), $A=\{a_1,..,a_{n_a}\}$ are audio features (spectrogram or waveform tokens), $T=\{t_1,..,t_{n_t}\}$ are language tokens, and $\theta$ spans all encoders, adapters, and the LLM component (Wu et al., 2023, Yin et al., 2023).

The dominant design paradigm for MLMs is modular: modality-specific encoders (e.g., CLIP for images (Wu et al., 2023), Whisper for audio (Koska et al., 8 Nov 2024)) extract high-dimensional feature sequences which are projected via trainable connectors into a common embedding space and fused, typically by concatenation or cross-attention, before or within a shared transformer or state-space backbone (e.g., LLaMA, Vicuna (Yin et al., 2023), Mamba (Qiao et al., 20 Mar 2024)).

Key fusion strategies include:

Early fusion: CONCAT or linear projection of modality embeddings prior to backbone (Wu et al., 2023).
Intermediate fusion: Intra-transformer cross-modal attention at selected layers (Yin et al., 2023, Koska et al., 8 Nov 2024).
Adapter-based fusion: Modality-specific queries attend to encoded tokens, producing proxy tokens for the LLM (e.g., Q-Former (Garg et al., 2023)).
MoE-based fusion: Modality-aware routers dynamically dispatch tokens by modality (Xia et al., 6 Jun 2025).

2. Training Objectives and Model Optimization

The optimization of MLMs is typically staged:

Pretraining (Multimodal Alignment):
- Contrastive objectives (e.g., CLIP-style InfoNCE) align paired samples, maximizing cosine similarity or minimizing cross-entropy between matched image-text or audio-text pairs (Wu et al., 2023, Yin et al., 2023).
- Masked or autoregressive language modeling, extended to cover interleaved multimodal token streams, imposes generative pressure and cross-modal context integration (Zhang et al., 24 Jan 2024, Koska et al., 8 Nov 2024).
- For generative modalities (T2I, T2M, T2V), diffusion objectives or GAN losses minimize reconstruction or denoising error conditioned on text-derived embeddings (Han et al., 29 May 2025).
Instruction and Supervised Fine-Tuning:
- Supervised next-token cross-entropy over multimodal dialogue and QA pairs (either image-conditioned or interleaved) (Garg et al., 2023, Lyu et al., 2023, Ye et al., 19 Aug 2024).
- Multimodal instruction corpora (e.g., LLaVA, ShareGPT4V, ALLaVA) enable open-ended, compositional tasks beyond pure captioning or classification (Yin et al., 2023, Koska et al., 8 Nov 2024).
Alignment Tuning:
- Reward-based strategies such as RLHF or DPO align model outputs to human ratings or preferences (Yin et al., 2023, Han et al., 29 May 2025).
- Auxiliary losses such as visual representation alignment (e.g., VisionLoss) ensure semantic fidelity at patch or region level (Ghatkesar et al., 8 May 2025).

Advanced optimization includes parameter-efficient fine-tuning (LoRA (Koska et al., 8 Nov 2024)), quantization-aware training for on-device deployment (Koska et al., 8 Nov 2024), and training-free composition by weight merging (Chen et al., 20 Feb 2024).

3. Core Modalities and Generative Capabilities

Current MLMs span a range of generative and discriminative tasks, tightly coupled to their input-output modality space (Han et al., 29 May 2025):

Generation Type	Representative Architectures	Key Techniques
Text-to-Text (T2T)	GPT, LLaMA, Mixtral, Vicuna	Autoregressive Transformer, MoE
Text-to-Image (T2I)	DALL·E 2, Stable Diffusion, GILL	CLIP embedding, Diffusion, CrossAttn
Text-to-Audio/Music	AudioLM, Jukebox, EAGLE-A	CLAP embedding, Latent Diffusion
Text-to-Video (T2V)	Make-A-Video, Sora, Emu-Video	Spatiotemporal Diffusion, Recaptioning
Text-to-Human-Motion	MotionDiffuse, MotionGPT, GenM3	Motion VQ-VAE, Diffusion, CrossAttn
Text-to-3D	DreamFusion, Shap-E, VolumeDiffusion	2D diffusion prior, consistency loss

Text and vision remain the best-aligned modalities, but state-of-the-art models (e.g., EAGLE, Macaw-LLM) process text, images, audio, video, and can emit multi-turn, interleaved input/output sequences (Koska et al., 8 Nov 2024, Lyu et al., 2023). Mixture-of-Experts architectures further enable scalable, specialization- or modality-driven routing (Xia et al., 6 Jun 2025, Han et al., 29 May 2025).

4. Empirical Performance and Limitations

MLMs set new records on a range of standardized benchmarks:

VQA (Visual QA): InstructBLIP ≈80%, LLaVA-1.5 ≈80%, Qwen-VL-Chat 78% (Wang et al., 2 Aug 2024, Yin et al., 2023).
Image Captioning (COCO CIDEr): InstructBLIP ≈110, MiniGPT-4 ≈105 (Wang et al., 2 Aug 2024).
ScienceQA: EAGLE (4.3B) 94.6%, Gemini 1.5 Pro 96.1% (Koska et al., 8 Nov 2024).
Audio ASR: EAGLE 2.6% WER, Qwen-Audio ≈5% (Koska et al., 8 Nov 2024, Wang et al., 2 Aug 2024).

However, studies directly interrogating the neurocognitive fidelity and experiential grounding of MLMs have found unexpected limitations. Bavaresco & Fernández (2024) (Bavaresco et al., 1 Apr 2025) show that language-only models (BERT, SimCSE) outperform contrastive vision–LLMs (MCSE, VisualBERT, CLAP) in capturing both human-experiential concept structure (Exp48 semantic norms) and fMRI semantic network patterns. All language–only vs. multimodal differences in their paper are significant ( $p < .05$ after Bonferroni correction), with BERT demonstrating the highest alignment to both normed experiential vectors ( $\rho = 0.53$ ) and human brain RDMs ( $\rho = 0.23$ ). Notably, LM representations explain more unique, brain-relevant semantic variance than any of the evaluated multimodal models.

Qualitative analysis reveals that MLMs only model select function word categories (subject and possessive pronouns, relative wh-words) in grounded text-to-image generation, failing on quantifiers, spatial prepositions, negation, and logical connectives (Sonkar et al., 2022). This indicates a gap in their ability to reason over relational, set-theoretic, or logical aspects, especially for categories not aligned to direct perceptual categories.

Model composition enables the construction of versatile MLMs from a set of independently trained unimodal or multimodal models. The NaiveMC and DAMC approaches aggregate modality-specific encoders and merge shared LLM weights (either by simple averaging or with task-driven adaptive weighting), producing a single model capable of zero-shot inference across any union of input modalities without further joint training (Chen et al., 20 Feb 2024). DAMC’s decoupled architecture maintains modality-specific and shared weights, supporting extensibility and robust cross-modal generalization. On MCUB, DAMC achieves up to 60% accuracy on four-modality tasks, outperforming single-modality and non-composed baselines.

Mixture-of-Experts (MoE) systems such as SMAR (Soft Modality-Aware Routing) deploy a learned, symmetric KL-divergence penalty on expert router distributions to flexibly dissociate or overlap expert utilization by modality, preserving language-only task performance (86.6% retention at only 2.5% pure-text data) while retaining strong multimodal abilities (Xia et al., 6 Jun 2025). These innovations allow scalable, efficient, and modular expansion to new modalities.

6. Persistent Challenges and Research Frontiers

Despite rapid progress, several critical limitations remain:

Data quality and scaling: High-quality, diverse multimodal data—especially for less-resourced modalities—is scarce. Overfitting to language priors, hallucinations, and lack of explicit modality grounding remain widespread (Ghatkesar et al., 8 May 2025, Wang et al., 2 Aug 2024).
Alignment to human cognition: Current contrastively-trained VLMs exhibit inferior alignment to experiential grounding and brain activation patterns relative to supervised LMs (Bavaresco et al., 1 Apr 2025).
Relational/functional word semantics: MLMs fail to robustly ground spatial, logical, and quantificational function words, which is vital for compositional reasoning (Sonkar et al., 2022).
Scalable, modular, and continual learning: Efficient expansion to new modalities and tasks, combined with preservation of earlier capabilities, requires new parameter-isolation, merging, and continual learning strategies (Chen et al., 20 Feb 2024, Zhang et al., 24 Jan 2024).

Research priorities include integrating objectives that target brain-relevant experiential structure and cross-modal alignment, advancing model composition frameworks, and constructing datasets that explicitly probe multimodal reasoning beyond surface alignment (e.g., spatial, logical, 3D categories) (Bavaresco et al., 1 Apr 2025, Ghatkesar et al., 8 May 2025, Sonkar et al., 2022).

References:

(Wu et al., 2023) Multimodal LLMs: A Survey
(Yin et al., 2023) A Survey on Multimodal LLMs
(Koska et al., 8 Nov 2024) Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small LLM
(Bavaresco et al., 1 Apr 2025) Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than LLMs?
(Chen et al., 20 Feb 2024) Model Composition for Multimodal LLMs
(Xia et al., 6 Jun 2025) SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal LLMs Preserving Language Capabilities
(Sonkar et al., 2022) A Visual Tour Of Current Challenges In Multimodal LLMs
(Ghatkesar et al., 8 May 2025) Perceiving Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models
(Qiao et al., 20 Mar 2024) VL-Mamba: Exploring State Space Models for Multimodal Learning
(Lyu et al., 2023) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
(Garg et al., 2023) On the Performance of Multimodal LLMs