Large Multimodal Models (LMMs) Overview

Updated 4 September 2025

Large Multimodal Models (LMMs) are AI systems that jointly process vision, language, and audio cues to enable unified perception and reasoning.
They fuse large pretrained language models with visual encoders, excelling in tasks like visual Q&A, spatial reasoning, and context-aware dialogue.
Emerging research focuses on unified evaluation, tool-augmented architectures, and model compression to overcome integration and efficiency challenges.

Large Multimodal Models (LMMs) are a class of AI models that integrate and jointly process information across multiple modalities—primarily vision and language but also audio and structured sensory inputs—to enable generalist perception, reasoning, and instruction-following capabilities. Modern LMMs are built by fusing large pretrained LLMs with powerful visual encoders, and are typically aligned for instruction following via large-scale multimodal instruction tuning. They have demonstrated proficiency in complex tasks such as visual question answering, image/text-based reasoning, open-domain dialogue, visual joke explanation, and mathematical problem solving, among others. The field has rapidly advanced, with emerging research focusing on integrated capability evaluation, architecture generalization, multilingual reasoning, retrieval-augmented generation, model efficiency, and robust real-world deployment.

1. Core Vision–Language Capabilities and Their Integration

A fundamental lens for understanding LMMs is their decomposition into core vision–language (VL) capabilities and the requirement for their integration on real-world tasks. MM-Vet (Yu et al., 2023) identifies six basic VL capabilities:

Recognition (Rec): Visual identification and classification of objects, scenes, and attributes, including counting.
Knowledge (Know): Leveraging both visual commonsense and encyclopedic (including time-sensitive) knowledge.
Optical Character Recognition (OCR): Reading and interpreting scene or document text.
Spatial Awareness (Spat): Understanding spatial relationships and orientations among visual elements.
Language Generation (Gen): Producing coherent and contextually appropriate textual responses, including long-form generation.
Math: Solving arithmetic and numerically driven problems by interpreting visual cues.

Crucially, advanced benchmarks examine not isolated capabilities, but their integration—modeling up to 16 fused skill combinations relevant to real-world open-ended tasks. Examples include explaining visual jokes (Recognition + Knowledge + Generation), solving visually grounded math problems requiring OCR and spatial awareness, and reading diagrams for reasoning over spatial and mathematical context.

2. Unified Evaluation Frameworks and Methodologies

The diversified nature of LMM outputs—ranging from short factual answers to free-form explanations—has motivated the development of unified evaluation metrics. MM-Vet introduces an LLM-based evaluator (actual implementation with GPT-4) that uses a few-shot prompt to calibrate scoring across answer styles. Evaluator details include:

Scoring (0–1) is assigned per sample based on semantic alignment with ground truth, using criteria such as full or partial element coverage (“<AND>”, “<OR>” parsing).
Overall and capability-specific scores are computed as:

$S = \frac{1}{N} \sum_{i=1}^{N} s_i \times 100\%, \quad S_c = \frac{1}{N_c} \sum_{i \in C} s_i \times 100\%$

where $s_i$ is the evaluator score and $N$ , $N_c$ are the sample counts per task or capability.

This LLM-based approach provides a consistent metric that accommodates open-ended, variable-length answers—necessary for fair benchmarking of complex, integrated multimodal outputs.

Additionally, capability-wise and ablation-based analyses on public benchmarks (e.g., ScienceQA, COCO, RefCOCOg) allow the dissection of strengths and weaknesses by modality and task integration.

3. System Paradigms and Comparative Insights

Two main LMM paradigms dominate current literature:

End-to-End Tuned LMMs: Models like LLaVA, BLIP-2, InstructBLIP, and MiniGPT-4 follow dense joint alignment of a vision encoder with a LLM. They excel at recognition and language generation but are typically less adept at leveraging specialized tools for tasks such as OCR or mathematical computation.
LLM-Tool-Using (Agent-Based) Systems: Methods like MM-ReAct and Transformers Agent incorporate explicit tool use and external APIs (e.g., OCR engines, dense captioning, math solvers) at inference time, facilitating strong performance in OCR-intensive or numerically grounded tasks.

Empirical findings include:

The strength of the LLM backbone is a principal determinant of integrated performance, especially for long-form generation and complex reasoning.
High-resolution vision encoders and large-scale pretraining data significantly boost recognition and spatial reasoning.
Tool-augmented approaches outperform monolithic (end-to-end) models on integration-reliant tasks (notably, spatial reasoning and math), while end-to-end models retain relative efficiency.
Even state-of-the-art models such as GPT-4V reach only ~68% overall on MM-Vet, underscoring persistent challenges in integrated capability fusion.

4. Specialized Domains: Multilingual, Mathematical, and Open-World Reasoning

Recent research expands LMM applicability to challenging domains:

Multilingual Multimodal Reasoning: Benchmarks like M4U (Wang et al., 24 May 2024) reveal that leading LMMs (GPT-4o, InstructBLIP) achieve only modest accuracy (max ~47.6%) on rigorous, discipline-diverse, and cross-lingual question sets containing both image and text. Models exhibit strong language preferences and marked degradation in cross-lingual cases (e.g., visual content in Chinese, question in German).
Multimodal Math (CMM-Math): Datasets such as CMM-Math (Liu et al., 4 Sep 2024) unveil limitations in current LMMs, especially on high school-level and geometry-heavy problems. Custom models (Math-LMMs) using interleaved image–text fusion and multi-stage math-specific fine-tuning significantly outperform standard LMMs on both Chinese and English multimodal math benchmarks.
Open-World Image Classification: LMMs can natively perform classification beyond closed-label taxonomies (Conti et al., 27 Mar 2025), outputting natural language class names. However, detailed metric analyses (Text Inclusion, Llama Inclusion, Semantic/Concept Similarity) indicate that LMMs are often generic in their predictions, struggle with fine-grained distinctions, and benefit from tailored domain prompts and chain-of-thought reasoning.

5. Architectural Innovations and Model Adaptation

Addressing core bottlenecks and scaling requirements, LMM research has developed:

Context-Aware and Retrieval-Augmented Components: CaMML (Chen et al., 6 Jan 2024) integrates a scalable hierarchical context perceiver for efficient retrieval and encoding of lengthy, multimodal context examples, substantially improving grounding, disambiguation, and reducing hallucination.
Decoupled Perception/Decoding: Lumen (Jiao et al., 12 Mar 2024) decouples perception into task-agnostic (shared fine-grained vision–language alignment) and task-specific decoding, markedly boosting performance on dense vision-centric tasks while offering efficient adaptation across object detection, segmentation, and pose estimation.
Object Detection and Spatial Reasoning: LMM-Det (Li et al., 24 Jul 2025) shifts from reliance on external detectors to data-driven label augmentation and prompt engineering, enabling LMMs to attain competitive detection recall and AP benchmarks without peripheral modules.
Model Compression and Adaptation: For edge deployment, adaptive layer-wise sparsity and KV-cache quantization methods (Zhang et al., 28 Jul 2025) deliver high compression ratios with minimal performance drop. Continual instruction-tuning protocols (He et al., 2023) mitigate catastrophic forgetting through data replay, model expansion, and task-similarity-informed regularization schemes—key for lifelong LMM deployment.

6. Current Limitations, Evaluation Challenges, and Open Problems

Despite rapid progress, several technical challenges persist:

Incomplete Integration: Overall multimodal integration remains suboptimal, with persistent gaps between specialized tool-enabled models and monolithic LMMs, especially in reasoning and math-intensive scenarios (Yu et al., 2023).
Catastrophic Forgetting: Continual instruction learning is vulnerable to rapid forgetting of earlier tasks, especially when not preceded by multi-task joint tuning (He et al., 2023).
Efficiency and Scalability: Compressing LMMs for edge and real-time applications is hampered by memory and compute constraints, driving advances in adaptive compression algorithms (Zhang et al., 28 Jul 2025).
Evaluation Coverage: Many benchmarks inadequately reflect the complexity of downstream use, especially for tasks requiring robust grounding, spatial reasoning, multi-image inference, or multilingual context (cf. MMR (Chen et al., 26 Aug 2024), CoCoT (Zhang et al., 5 Jan 2024), MMKC-Bench (2505.19509)).
Trust and Explainability: The opaque internal representations of LMMs motivate ongoing development of concept-based dictionary learning and grounding approaches for interpretability (Parekh et al., 12 Jun 2024).
Uncertainty Quantification: Model-agnostic frameworks such as Uncertainty-o (Zhang et al., 9 Jun 2025) are emerging to measure and leverage semantic uncertainty in LMM responses, relevant for downstream hallucination detection and safe AI deployment.

7. Future Directions and Open Resources

Key directions include:

More Expressive, Integrated Architectures: Expanding hierarchical, modular, and decoupled perception layers to better handle densely interleaved, long-context, and cross-modal tasks.
Generalist–Specialist Bridging: Fine-tuning the balance between generalist reasoning, efficient integration of external tools, and retrieval-augmented grounding.
Robust Lifelong and Multilingual Learning: Enhancing anti-forgetting strategies and better leveraging cross-lingual and domain adaptation to widen LMM accessibility (Lupascu et al., 8 Feb 2025).
Real-World Conflict and Safety: Addressing multimodal knowledge conflicts, prioritizing external evidence integration, and standardizing conflict benchmarks to bolster RAG system reliability (2505.19509).
Resource and Code Availability: Comprehensive open-sourcing of models, code, benchmarks, and data curation pipelines (e.g., xGen-MM/BLIP-3 (Xue et al., 16 Aug 2024), CaMML, MMKC-Bench) accelerates reproducibility and iterative research across the discipline.

In conclusion, LMMs represent the convergence of vision, language, and reasoning components into highly capable, integrative AI systems. While their baseline abilities are now well established, targeted research into integrated evaluation, efficient scaling, robust adaptation, and trustworthy deployment remains at the forefront of the field.