Papers
Topics
Authors
Recent
2000 character limit reached

Multimodal Language Models Overview

Updated 13 December 2025
  • Multimodal language models are computational systems that integrate various data modalities using modality-specific encoders fused via attention or concatenation strategies.
  • They employ staged training objectives, including contrastive pretraining and supervised fine-tuning, to align and generate coherent multimodal outputs.
  • Current models excel in tasks like VQA and image captioning yet face challenges in grounding relational and logical expressions.

A multimodal LLM (MLM) is a computational system that jointly ingests, represents, and generates data over two or more modalities—most frequently text, images, and audio, but increasingly video, human motion, and 3D representations as well. MLMs extend LLMs to integrate modality-specific encoders via fusion or attention mechanisms, enabling richer context and greater versatility in linguistic, perceptual, and reasoning tasks. These models are built on tightly coupled architectural, optimization, and data-alignment frameworks that allow for intra- and cross-modal representation learning and serve as the backbone of current advances in general-purpose AI.

1. Definition and Architectural Foundations

A multimodal LLM ff is formally defined as a parameterized function:

Y^=f(V,A,T;θ)\hat{Y} = f(V, A, T; \theta)

where V={v1,..,vnv}V=\{v_1,..,v_{n_v}\} are visual inputs (image patches or frames), A={a1,..,ana}A=\{a_1,..,a_{n_a}\} are audio features (spectrogram or waveform tokens), T={t1,..,tnt}T=\{t_1,..,t_{n_t}\} are language tokens, and θ\theta spans all encoders, adapters, and the LLM component (Wu et al., 2023, Yin et al., 2023).

The dominant design paradigm for MLMs is modular: modality-specific encoders (e.g., CLIP for images (Wu et al., 2023), Whisper for audio (Koska et al., 8 Nov 2024)) extract high-dimensional feature sequences which are projected via trainable connectors into a common embedding space and fused, typically by concatenation or cross-attention, before or within a shared transformer or state-space backbone (e.g., LLaMA, Vicuna (Yin et al., 2023), Mamba (Qiao et al., 20 Mar 2024)).

Key fusion strategies include:

2. Training Objectives and Model Optimization

The optimization of MLMs is typically staged:

  1. Pretraining (Multimodal Alignment):
    • Contrastive objectives (e.g., CLIP-style InfoNCE) align paired samples, maximizing cosine similarity or minimizing cross-entropy between matched image-text or audio-text pairs (Wu et al., 2023, Yin et al., 2023).
    • Masked or autoregressive language modeling, extended to cover interleaved multimodal token streams, imposes generative pressure and cross-modal context integration (Zhang et al., 24 Jan 2024, Koska et al., 8 Nov 2024).
    • For generative modalities (T2I, T2M, T2V), diffusion objectives or GAN losses minimize reconstruction or denoising error conditioned on text-derived embeddings (Han et al., 29 May 2025).
  2. Instruction and Supervised Fine-Tuning:
  3. Alignment Tuning:

Advanced optimization includes parameter-efficient fine-tuning (LoRA (Koska et al., 8 Nov 2024)), quantization-aware training for on-device deployment (Koska et al., 8 Nov 2024), and training-free composition by weight merging (Chen et al., 20 Feb 2024).

3. Core Modalities and Generative Capabilities

Current MLMs span a range of generative and discriminative tasks, tightly coupled to their input-output modality space (Han et al., 29 May 2025):

Generation Type Representative Architectures Key Techniques
Text-to-Text (T2T) GPT, LLaMA, Mixtral, Vicuna Autoregressive Transformer, MoE
Text-to-Image (T2I) DALL·E 2, Stable Diffusion, GILL CLIP embedding, Diffusion, CrossAttn
Text-to-Audio/Music AudioLM, Jukebox, EAGLE-A CLAP embedding, Latent Diffusion
Text-to-Video (T2V) Make-A-Video, Sora, Emu-Video Spatiotemporal Diffusion, Recaptioning
Text-to-Human-Motion MotionDiffuse, MotionGPT, GenM3 Motion VQ-VAE, Diffusion, CrossAttn
Text-to-3D DreamFusion, Shap-E, VolumeDiffusion 2D diffusion prior, consistency loss

Text and vision remain the best-aligned modalities, but state-of-the-art models (e.g., EAGLE, Macaw-LLM) process text, images, audio, video, and can emit multi-turn, interleaved input/output sequences (Koska et al., 8 Nov 2024, Lyu et al., 2023). Mixture-of-Experts architectures further enable scalable, specialization- or modality-driven routing (Xia et al., 6 Jun 2025, Han et al., 29 May 2025).

4. Empirical Performance and Limitations

MLMs set new records on a range of standardized benchmarks:

However, studies directly interrogating the neurocognitive fidelity and experiential grounding of MLMs have found unexpected limitations. Bavaresco & Fernández (2024) (Bavaresco et al., 1 Apr 2025) show that language-only models (BERT, SimCSE) outperform contrastive vision–LLMs (MCSE, VisualBERT, CLAP) in capturing both human-experiential concept structure (Exp48 semantic norms) and fMRI semantic network patterns. All language–only vs. multimodal differences in their paper are significant (p<.05p < .05 after Bonferroni correction), with BERT demonstrating the highest alignment to both normed experiential vectors (ρ=0.53\rho = 0.53) and human brain RDMs (ρ=0.23\rho = 0.23). Notably, LM representations explain more unique, brain-relevant semantic variance than any of the evaluated multimodal models.

Qualitative analysis reveals that MLMs only model select function word categories (subject and possessive pronouns, relative wh-words) in grounded text-to-image generation, failing on quantifiers, spatial prepositions, negation, and logical connectives (Sonkar et al., 2022). This indicates a gap in their ability to reason over relational, set-theoretic, or logical aspects, especially for categories not aligned to direct perceptual categories.

5. Cross-Modal Synergies, Model Composition, and Scalability

Model composition enables the construction of versatile MLMs from a set of independently trained unimodal or multimodal models. The NaiveMC and DAMC approaches aggregate modality-specific encoders and merge shared LLM weights (either by simple averaging or with task-driven adaptive weighting), producing a single model capable of zero-shot inference across any union of input modalities without further joint training (Chen et al., 20 Feb 2024). DAMC’s decoupled architecture maintains modality-specific and shared weights, supporting extensibility and robust cross-modal generalization. On MCUB, DAMC achieves up to 60% accuracy on four-modality tasks, outperforming single-modality and non-composed baselines.

Mixture-of-Experts (MoE) systems such as SMAR (Soft Modality-Aware Routing) deploy a learned, symmetric KL-divergence penalty on expert router distributions to flexibly dissociate or overlap expert utilization by modality, preserving language-only task performance (86.6% retention at only 2.5% pure-text data) while retaining strong multimodal abilities (Xia et al., 6 Jun 2025). These innovations allow scalable, efficient, and modular expansion to new modalities.

6. Persistent Challenges and Research Frontiers

Despite rapid progress, several critical limitations remain:

  • Data quality and scaling: High-quality, diverse multimodal data—especially for less-resourced modalities—is scarce. Overfitting to language priors, hallucinations, and lack of explicit modality grounding remain widespread (Ghatkesar et al., 8 May 2025, Wang et al., 2 Aug 2024).
  • Alignment to human cognition: Current contrastively-trained VLMs exhibit inferior alignment to experiential grounding and brain activation patterns relative to supervised LMs (Bavaresco et al., 1 Apr 2025).
  • Relational/functional word semantics: MLMs fail to robustly ground spatial, logical, and quantificational function words, which is vital for compositional reasoning (Sonkar et al., 2022).
  • Scalable, modular, and continual learning: Efficient expansion to new modalities and tasks, combined with preservation of earlier capabilities, requires new parameter-isolation, merging, and continual learning strategies (Chen et al., 20 Feb 2024, Zhang et al., 24 Jan 2024).

Research priorities include integrating objectives that target brain-relevant experiential structure and cross-modal alignment, advancing model composition frameworks, and constructing datasets that explicitly probe multimodal reasoning beyond surface alignment (e.g., spatial, logical, 3D categories) (Bavaresco et al., 1 Apr 2025, Ghatkesar et al., 8 May 2025, Sonkar et al., 2022).


References:

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multimodal Language Models.