Multimodal Large Language Models

Updated 7 July 2025

Multimodal Large Language Models are architectures that process and generate data across diverse modalities such as text, images, audio, and video.
They combine pretrained modality encoders with large language models via learnable connectors to achieve tasks like generation, retrieval, and visual grounding.
Emergent capabilities include zero-shot reasoning, in-context learning, and domain-specific applications, positioning MLLMs as key to embodied AI systems.

A multimodal LLM (MLLM) is a model architecture designed to process and generate data across multiple modalities, such as text, images, audio, video, and structured data, by building upon the powerful world knowledge and reasoning abilities of LLMs. MLLMs unify disparate sensory and linguistic representations, leveraging modular architectures and novel training paradigms to perform highly varied downstream tasks including generation, retrieval, visual understanding, grounding, reasoning, and interaction across domains. These models exhibit emergent capabilities and form a foundational research direction towards more general-purpose and embodied artificial intelligence systems.

1. Architectural Foundations and Model Formulation

A typical MLLM architecture consists of three core modules:

Pretrained Modality Encoder: Specialized encoders for modalities such as images, video, or audio transform raw sensory input into dense feature representations.
Pretrained LLM: An LLM, extensively trained on textual corpora, provides the reasoning, world knowledge, and generation backbone.
Modality Interface (Connector): A learnable module aligns modality-specific features to the input space of the LLM. Connectors may implement token-level fusion (such as query tokens or projection layers) or feature-level fusion (such as inserting cross-attention layers directly into the LLM transformer stack).

The overall computational flow is often formalized as

$A = f(\mathcal{J}, \mathcal{M}; \theta),$

with training objective

$\mathcal{L}(\theta) = -\sum_{i=1}^{N} \log p(\mathcal{R}_i \mid \mathcal{J}, \mathcal{R}_{<i}; \theta),$

where $\mathcal{J}$ represents instructions (if any), $\mathcal{M}$ the multimodal input, and $\mathcal{R}$ the response or caption.

2. Training Paradigms and Methodologies

MLLMs are typically trained in distinct sequential stages, each targeting different aspects of alignment and reasoning:

Pre-training: The model is exposed to large-scale, paired datasets (e.g., image–text pairs from CC or LAION) to learn initial cross-modal associations through next-token prediction.
Instruction Tuning: The model is fine-tuned using datasets structured as (instruction, multimodal input, response) triplets. This enhances the model’s ability to follow user commands in both narrowly and broadly defined tasks.
Alignment Tuning: Reinforcement Learning with Human Feedback (RLHF), direct preference optimization, or similar strategies are used to further tune model outputs for factual correctness, alignment with human intent, and reduced hallucination rate.

These stages have proven critical to unlocking “emergent” model behaviors that surpass traditional vision-LLMs (2306.13549) and support advanced compositional, generative, and interactional tasks.

3. Emergent Capabilities and Expanding Modalities

MLLMs demonstrate several emergent capabilities arising from the integration of large-scale LLMs and cross-modal alignments:

Generative Reasoning: The ability to produce extended narratives, perform zero-shot OCR-free mathematics in images, and synthesize new outputs given multimodal cues.
Fine-grained Control and Grounding: Recent designs enable region-level grounding, phrase-level alignment (e.g., via Markdown-style links of text spans to image bounding boxes (2306.14824)), and pixel/region specificity in visual responses.
Support for Additional Modalities and Languages: Architectures now extend to video, audio, 3D data, and multilingual understanding, utilizing adaptable connectors and unified embedding spaces. Some systems are designed for multi-to-multi generation (i.e., any modality to any modality) (2401.13601).
Domain and Scenario Extensions: MLLMs find application in specialized areas such as medical imaging, document parsing, GUI interactions, and embodied robotics, often by leveraging techniques like multimodal in-context learning (M-ICL), chain-of-thought (M-CoT), or LLM-aided visual reasoning (LAVR).

4. Evaluation, Benchmarks, and Performance Considerations

Assessment of MLLMs is conducted across a diverse landscape of tasks and datasets, reflecting their broad utility:

Task Domain	Representative Benchmarks	Typical Metrics
Visual Understanding	VQAv2, GQA, VizWiz, TextVQA	Accuracy, score
Visual Grounding	RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities	Recall@K, Acc@IoU
Generation and Editing	COCO, FID, CLIP/DINO similarity	FID, CIDEr
Reasoning/Instruction	ScienceQA, MathVista, SEED-Bench	Accuracy, reasoning

The design trade-offs between performance and computational requirements are significant. For instance, heavier models such as Flamingo, PaLI, or IDEFICS may demand orders of magnitude more GPU hours than lighter models (e.g., LLaVA, BLIP-2), while parameter-efficient fine-tuning schemes (LoRA, Prefix-Tuning) and high-resolution processing recipes offer practical paths to improved performance under limited resource constraints (2401.13601, 2402.12451).

5. System Extensions, Challenges, and Model Composition

The field has advanced towards several system-level innovations:

Model Composition and Merging: Instead of exclusive joint training, some approaches merge multiple expert MLLMs (with different modalities or domains) using parameter decoupling and adaptive adjustment. Techniques such as DAMC (Decoupled and Adjusted Model Composition) (2402.12750) and optimization-based weight merging (2505.19892) enable “zero-shot” integration of new modalities, fostering more flexible system expansion.
Extended Reasoning: MLLMs support inference-time prompt engineering, including multimodal in-context learning and explicit reasoning trace generation (M-CoT). LLM-aided visual reasoning allows dynamic invocation of external tools (e.g., segmentation, OCR) at generation time.
Grounding and Spatial Alignment: Methods like those in Kosmos-2 (2306.14824) discretize spatial features (bounding boxes) as tokens, linking natural language expressions directly to image regions and facilitating visually grounded applications.

However, open challenges remain, including:

Long-context handling (processing videos or documents with many visual components)
Robustness against multimodal hallucinations (incorrect or unfounded content)
Instruction following across complex, multi-step tasks
Efficient continual learning and catastrophic forgetting avoidance
Realization of safe, deployable systems—especially for high-stakes and resource-constrained contexts

6. Applications and Future Research Directions

Practical use cases for MLLMs are rapidly expanding:

Assistive Tools: Visual dialog systems, document and image understanding, multimodal search interfaces, and smart digital avatars
Specialized Reasoning: Medical imaging, bioimage analysis (2407.19778), and cross-modal recommendation systems (2408.09698)
Interactive Agents and Robotics: Embodied intelligence with perception, planning, and control grounded in both language and the environment

The research community anticipates continued advances towards:

Integration of more diverse modalities (e.g., sensor data, speech, 3D, physiological signals)
More robust multimodal retrieval-augmented generation frameworks
Efficient, modular architectures suited for edge deployment and continual model evolution
Improved alignment, safety, and fairness through advanced instruction tuning, RLHF, and model monitoring
The development of universal “Omni-language” or “OMM” models capable of seamless cross-modal understanding and communication (2505.19892).

The field maintains collaborative growth via frequently updated open-source resources, including comprehensive benchmarks and tracking repositories (e.g., https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models (2306.13549), https://mm-LLMs.github.io (2401.13601)).

7. Summary Table: Representative MLLM Components and Innovations

Component	Typical Choices	Notable Innovations
Modality Encoder	Pretrained ViT, CLIP, specialized audio/video encoders	Mixture-of-Expert routing, Q-Former
Connector/Adapter	Linear projection, cross-attention, query tokens	Transformer Q-Former, Visual Merger
LLM Backbone	Text models (LLaMA, OPT, T5), sometimes partially frozen	MoE specialization, language retention
Output Decoder	LLM generative head, diffusion model, task-specific heads	Bilingual/multilingual data, OCR handling
Training Strategy	MM pre-training, instruction/alignment tuning, RLHF	Data blending, instruction augmentation
Benchmark/Scenario	VQA, grounding, captioning, retrieval, document parsing, robotics	Unified evaluation, model merging

MLLMs represent an overview of foundational LLM capabilities and cross-modal awareness, driving advances towards human-level multi-modal understanding, flexible task adaptation, and scalable deployment in both academic and industrial settings.