Multi-modal Large Language Models

Updated 29 September 2025

Multi-modal LLMs are models that combine natural language with visual data using unified architectures like Transformer-based language models paired with visual encoders.
They employ techniques such as Q-Former, cross-attention, and adapter alignment to merge modality-specific features into a common embedding space for joint reasoning.
Research shows that advanced prompting, including chain-of-thought, improves performance especially in closed-source models, while open-source variants still face significant alignment challenges.

Multi-modal LLMs (MLLMs) are foundation models designed to jointly process natural language and other sensory modalities—most commonly visual information such as images or video—within a unified architecture. By integrating language and perception, MLLMs aim to exhibit emergent abilities in complex reasoning, scene understanding, and instruction following that extend the capabilities of text-only LLMs. This article reviews the architectural foundations, alignment and training techniques, evaluated abilities and limitations, applications, performance bottlenecks, and emerging research directions for MLLMs, drawing on recent empirical, methodological, and survey literature.

1. Foundational Principles and Model Architectures

Modern MLLMs are built atop large pre-trained LLMs (typically Transformer-based architectures such as LLaMA, Vicuna, or GPT-4), paired with powerful visual encoders (e.g., CLIP ViT, EVA ViT, Vision Transformer, or CNNs) to extract modality-specific features. A vision-to-language adapter bridges the visual and textual representations, projecting dense visual embeddings into the LM’s embedding space for cross-modal fusion (Caffagni et al., 2024). There are several adapter designs:

Linear projections/MLPs: Simple mappings to align feature spaces.
Q-Former: A Transformer-based module with learnable query tokens and cross-attention to extract visual content suitable for language reasoning (as in BLIP-2).
Cross-attention modules: Directly augmenting the LM with vision-aware attention, sometimes with gating to interpolate between unimodal and multimodal paths (as in Flamingo).
Perceiver-based compression: Downsampling visual tokens for efficient processing.

The canonical MLLM autoregressively generates tokens conditioned jointly on preceding text and visual inputs: $L = -\sum_t \log P(x_t \mid x_{<t}, v)$ where $v$ is the appropriately aligned visual embedding.

Recent advances in context handling extend the input window for long images or videos via pooling, merging, or position ID sharing (She et al., 2024). Architectural modularity—e.g., Mixture of Experts (MoE) layers specialized for different modalities—is increasingly used for both scaling efficiency and specialization (Han et al., 29 May 2025). For high-resolution images or videos, dedicated token compression modules such as FOLDER and VisToG are employed to reduce computational burdens by aggressively merging visual tokens post-encoding while preserving semantic detail (Huang et al., 2024, Wang et al., 5 Jan 2025).

2. Alignment and Training Paradigms

A central challenge in MLLMs is aligning heterogeneous and high-dimensional visual features with the text-centric latent space of the LLM (Caffagni et al., 2024). Key strategies:

Contrastively pre-trained vision encoders: CLIP-style objectives to encourage image-text pairing in a joint embedding space (Carolan et al., 2024).
Adapter alignment: Tuning adapters (linear, Q-Former, or cross-attention blocks) to project vision features into the LM space.
Instruction tuning: Fine-tuning on curated vision-language instructions (e.g., LLaVA-Instruct, LRV-Instruct) using autoregressive cross-entropy loss, with or without parameter-efficient fine-tuning methods such as LoRA or prompt tuning.
Two-stage training: Adapter alignment with large-scale image-text pairs followed by instruction tuning with higher quality conversation data (Caffagni et al., 2024).
Chain-of-Thought (CoT) prompting: Encouraging the model to explicitly decompose complex tasks (e.g., “Let’s think step by step”) improves multi-hop reasoning in closed-source models (Ahrabian et al., 2024, Han et al., 29 May 2025).
Self-supervised and cross-modal masked modeling: Masked prediction objectives over both text and visual tokens to align representational spaces (Liang et al., 2024).
Reinforcement learning from human feedback (RLHF) and variants (including Direct Preference Optimization) are applied in some domains to encourage human-aligned generation across modalities (Han et al., 29 May 2025).

3. Evaluated Abilities and Limitations

3.1 Nonverbal and Abstract Reasoning

A systematic and quantitative evaluation on nonverbal abstract reasoning tasks, such as variants of Raven’s Progressive Matrices (RPM), reveals that most current MLLMs—both pre-trained and instruction-tuned—struggle to surpass random or majority guessing baselines (Ahrabian et al., 2024). For example, on the IQ50 RPM variant, open-source MLLMs generally exhibit accuracy improvements of only ±8% over random chance. Even when closed-source models such as GPT-4V demonstrate modestly higher performance, only about 26% of their answers are coherently reasoned and logically sound.

A representative mathematical heuristic for RPM completion involves additive and subtractive compositionality of visual features: $q_{2,2} = q_{2,1} + q_{1,2} - q_{1,1}$ with weights $\alpha_{q_{1,1}} = -1$ , $\alpha_{q_{1,2}} = 1$ , $\alpha_{q_{2,1}} = 1$ ; current models fail to robustly implement this sort of visual algebra.

CoT and explicit guided prompts can double or even further increase closed-source model accuracy, but open-source models do not reliably benefit from additional context or CoT demonstrations.

3.2 Visual and Textual Perception

MLLMs often hallucinate about rotation, shading, and fine shape distinctions, indicating brittle visual parsing. When rendered as human-written text descriptions (i.e., purely language modality), MLLMs—especially open-source—show almost no nonverbal reasoning, underscoring modes of error propagation from visual to language processing.

Manual inspection shows verbose, descriptive outputs rather than logically structured reasoning; “unfaithful” explanations (incorrect rationales for correct answers) are common.

3.3 Scaling Laws and Model Size

Empirical scaling curves—benchmarking accuracy against model parameter count—indicate nonlinear returns: larger models do not uniformly outperform smaller ones for visual abstract reasoning. Superior alignment and multimodal instruction tuning in closed-source models have greater impact than raw scale (Ahrabian et al., 2024).

4. Performance Differentials: Open-Source vs. Closed-Source Models

Closed-source MLLMs (e.g., GPT-4V, Gemini-Pro-Vision) exhibit distinctly stronger multi-modal reasoning abilities, generate more logically coherent rationales, and show marked improvements with CoT and corrective hints, while most open-source models stagnate near baseline (Ahrabian et al., 2024). In detailed manual evaluation, open-source models rarely produce coherent, correct answers with faithful justifications even after instruction tuning. Nevertheless, absolute performance levels in all models remain well below those of skilled humans.

5. Methods for Improvement: Prompting, Correction, and In-Context Learning

Several methods for boosting MLLM performance on challenging tasks have been systematically evaluated (Ahrabian et al., 2024):

Chain-of-Thought (CoT) prompting: In settings where prompts or demonstration shots explicitly decompose the problem (“Let’s think step by step”), closed-source models exhibit up to 100% improvement in some nonverbal tasks. Open-source models do not reliably respond to the same treatment.
Guided and Corrective Prompting: General and sample-specific hints, or feedback after an initial response, yield notable gains—especially in closed-source models.
Symmetrical and Asymmetrical In-Context Learning: Presenting coherent multi-modal or text-only CoT demonstrations boosts closed-source models, though open-source models fail to consistently utilize such contexts.

These prompting strategies also improve reasoning “faithfulness”—alignment between the answer and its rationale.

6. Challenges, Bottlenecks, and Current Research Directions

6.1 Visual and Textual Alignment Limits

Serious bottlenecks remain in the perceptual granularity and logical integration of visual features with textual reasoning (Ahrabian et al., 2024). Error sources include poor feature extraction, lack of compositional inductive bias, and propagation of ambiguous or verbose textual outputs that restate but do not solve perceptual puzzles.

6.2 Benchmark Coverage

Contemporary benchmarks (RPM-style matrices, IQ50, RAVEN-S) expose salient defects in perceptual reasoning and semantic alignment—motivating broader coverage (e.g., scene depth, spatial transformations, and multi-hop hybrid reasoning) in future evaluations.

6.3 Scalability and Efficiency

Scaling up visual inputs (high-resolution images, video) or long contexts quickly leads to computational bottlenecks. Solutions include visual token merging, efficient adapters, and context-aware compression to manage hardware and memory constraints while minimizing information loss (Huang et al., 2024, Wang et al., 5 Jan 2025).

7. Outlook: Future Directions and Implications

Findings to date suggest several critical research trajectories:

Understanding Error Sources: Analysis of attention weights, internal representations, and generation biases to locate weak reasoning steps (Ahrabian et al., 2024).
Alignment and Prompt Engineering: Advanced prompt orchestration (including “self-talk” and multi-step correction) to close reasoning gaps.
Comprehensive, Holistic Benchmarking: Broadened evaluations encompassing abstract, compositional, and real-world visual reasoning abilities.
Closing the Open/Closed-Source Gap: Open-source MLLMs require improved perceptual alignment and richer instruction tuning; simply scaling model size or dataset volume is insufficient.
Faithfulness and Trust: Focusing on both accuracy and the consistency/faithfulness of generated rationales is paramount for the development of robust, trustworthy MLLMs (Chou et al., 2024).

Overall, current MLLMs—especially open-source variants—struggle with tasks that require precise, abstract, and integrated use of visual and language modalities. Closed-source models benefit disproportionately from advanced prompting and guided learning strategies but have not yet achieved parity with human-level nonverbal abstract reasoning. Near-term progress will depend on improved perceptual alignment, new evaluation paradigms, targeted architectural innovation, and holistic approaches to model introspection and error mitigation.