Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
103 tokens/sec
GPT-4o
83 tokens/sec
Gemini 2.5 Pro Pro
63 tokens/sec
o3 Pro
16 tokens/sec
GPT-4.1 Pro
61 tokens/sec
DeepSeek R1 via Azure Pro
39 tokens/sec
2000 character limit reached

Multimodal Large Language Models (MLLMs)

Last updated: June 11, 2025

Below is a polished, fact-based final version of the article on Multimodal LLMs ° (MLLMs °). All statements are referenced exclusively to the survey paper "A Survey on Multimodal LLMs" (Yin et al., 2023 ° ), ensuring accuracy and full sourcing. The style has been enhanced for academic clarity and rigor.


Multimodal LLMs (MLLMs): Architecture, Capabilities, Training, and Future Directions

1. Basic Formulation and Concepts

Architecture:

A canonical Multimodal LLM ° (MLLM °) employs a modular design, integrating three foundational components:

  • Modality Encoder: Converts raw data ° from each modality (e.g., images, audio) into compact, semantically rich representations. Vision encoders ° typically use pre-trained models such as CLIP ° or ConvNext. Analogous, state-of-the-art encoders are leveraged for other modalities like audio and video.
  • LLM °: Functions as the central reasoning engine, with prevalent architectures including LLaMA, Flan-T5, Vicuna, and Qwen °. The LLM is responsible for interpreting instructions, multi-modal context, and generating language-centric outputs.
  • Modality Connector (“Connector”): Acts as an interface bridging each encoder’s output to the LLM input space °. This can be implemented at either the token-level (e.g., Q-former, MLP) or feature/map level (e.g., cross-attention modules).

Architecture Flow Diagram:

1
2
3
4
[Input Modality] → [Modality Encoder] → [Connector] → [LLM] 
       → [Output (text or other modalities)]
                     ↓ (optional)
             [Modality Generator]

Core Mathematical Formulation:

The MLLM instruction-following stage is modeled as: A=f(I,M;θ)\mathcal{A} = f(\mathcal{I}, \mathcal{M}; \theta) where:

  • I\mathcal{I}: input instruction,
  • M\mathcal{M}: multimodal input,
  • A\mathcal{A}: predicted answer,
  • θ\theta: model parameters.

The autoregressive training ° objective for generating output tokens ° is: L(θ)=i=1Nlogp(RiI,R<i;θ)\mathcal{L}(\theta) = -\sum_{i=1}^{N} \log p(\mathcal{R}_i|\mathcal{I}, \mathcal{R}_{<i}; \theta) where Ri\mathcal{R}_i is the ii-th response token, NN is response length.

Training Strategy:

Typical MLLM training proceeds in three stages:

  • Pre-training: Modality encoders ° and LLMs ° are co-trained or aligned on very large, generally weakly-supervised, multi-modal datasets ° (e.g., LAION-5B, COYO-700M). The goal is to learn robust cross-modal representations °.
  • Instruction Tuning: The model is further trained to robustly follow multimodal, instruction-based datasets—curated for tasks such as Visual Question Answering (VQA), captioning, and step-wise reasoning—using structured templates.
  • Alignment Tuning: To improve output helpfulness ° and reduce hallucinations, this stage uses techniques like Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization ° (DPO), optimizing the model for factuality, safety, and user intent.

Data:

  • Pre-training datasets: Broad, massive, and potentially noisy paired multi-modal data (e.g., LAION-5B) for cross-modal alignment °.
  • Instruction-tuning ° datasets: Curated, diverse, and high-quality, often covering VQA, visual reasoning °, captioning (structured via templates).
  • Alignment-tuning datasets: Consist of output pairs scored by human or model feedback (e.g., GPT-4V °), capturing aspects such as helpfulness, factuality, and safety.

Evaluation:

MLLMs are benchmarked both on closed-set tasks (using fixed datasets such as MME, MMBench, MathVista) and open-set, dialogue-based interactive tasks, measuring both accuracy and broader emergent abilities.


2. Emergent Capabilities

MLLMs demonstrate novel, emergent capabilities that distinguish them from previous multimodal systems, including:

  • Open-ended image-based storytelling and description.
  • Mathematical reasoning on images without explicit OCR °.
  • Interpretation and explanation of visual humor, memes, and abstract relations.
  • Direct code or web layout generation ° from scene images.
  • Multi-turn, coherent dialogue grounded in complex visual scenes; multi-step, factual, and compositional reasoning ° across modalities.

Comparison to Traditional Multimodal Models:

Prior approaches (e.g., CLIP, OFA, BLIP) typically used discriminative or specialized generative architectures ° for specific tasks (classification, retrieval, captioning), lacked large-scale instruction alignment, and struggled with generality, compositionality, and creative cross-task adaptation. MLLMs leverage large-scale LLM reasoning ° and multi-stage cross-modal alignment, enabling zero-shot/few-shot learning, open-set task ° adaptation, and creative synthesis not previously achievable.


3. Research Topics and Extensions

Granularity:

MLLMs are moving from global (whole-image) features to region-level (bounding boxes) and even pixel-level (point, mask, sketch) grounding. This granularity enables more localized, precise, or context-sensitive visual reasoning as seen in models like Shikra, Osprey, Ferret, and Lisa.

Modalities:

Beyond vision-language, new research integrates video, audio, 3D point clouds, and other data types. Models like NExT-GPT demonstrate flexible multi-modal input/output, handling arbitrary mixes such as image-text-video-audio.

Languages:

Efforts focus on multi-lingual support ° (e.g., Qwen-VL for English/Chinese), enabling MLLMs to generalize or transfer across languages even with limited non-English data.

Scenarios/Extensions:

MLLMs are being adapted for:

Key Advanced Techniques:

  • Multimodal In-Context Learning ° (M-ICL): Enables adaptation to new tasks from a handful of demonstrations at inference, via templated prompts.
  • Multimodal Chain-of-Thought (M-CoT): Explicit stepwise reasoning in multimodal contexts; enhances trustworthiness and interpretability.
  • LLM-Aided Visual Reasoning (LAVR): LLMs orchestrate specialized visual tools or submodels for complex compositional reasoning, as seen in HuggingGPT, MM-REACT, Chameleon, and VisProg.

4. Challenges and Future Directions

Open Problems:

  • Scalability: MLLMs struggle with long-context reasoning ° (e.g., long documents, videos), limiting holistic analysis.
  • Instruction Generalization: Open-source models lag GPT-4V in following nuanced, diverse instructions.
  • Reasoning Mechanisms: While M-ICL/M-CoT show promise, deep, robust multimodal reasoning ° requires further research in data, scale, and architecture.
  • Embodied AI °: Achieving robust, real-world interactive or control agents demands advances across perception, reasoning, planning, and execution.
  • Robustness: Models remain vulnerable to adversarial prompts and hallucinations; safe deployment requires better alignment and validation across diverse real-world contexts °.

5. Additional Resources

A curated, publicly maintained repository tracking the latest MLLM research, datasets, and benchmarks is available at Awesome-Multimodal-Large-Language-Models. This resource is regularly updated and serves as a valuable hub for practitioners and researchers alike.


Conclusion

The surveyed literature presents Multimodal LLMs as a transformative foundation for future AI. By combining flexible, modular architectures, advanced cross-modal alignment, and large-scale instruction tuning, MLLMs achieve emergent abilities that represent a step change over traditional multimodal methods °. Systematic benchmarking, ongoing improvements in training strategy, architecture, and safety will be critical as the field advances toward general, robust, and trustworthy multimodal intelligence ° (Yin et al., 2023 ° ).


[References available at https://arxiv.org/abs/([Yin et al., 2023 °](/search?q=A%20Survey%20on%20Multimodal%20Large%20Language%20Models) )]