This paper, "Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models" (Li et al., 8 May 2025 ), provides a comprehensive overview of the evolution and current state of Large Multimodal Reasoning Models (LMRMs). It structures the field's development into a four-stage roadmap, emphasizing the shift from perception-driven, modular systems to unified, language-centric frameworks, and finally, towards a vision of natively multimodal, agentic models. The survey analyzes over 540 publications, discussing key technical challenges, capabilities, and future directions, grounded in experimental insights.
The paper proposes a four-stage roadmap for LMRMs:
- Perception-Driven Modular Reasoning: Early approaches (pre-Transformer era) decomposed multimodal tasks into distinct modules for representation, alignment, fusion, and reasoning. Models like Neural Module Networks (NMN) [andreas2016neural] and those using variations of attention (e.g., MAC [hudson2d018compositional], MCAN [yu2019deep]) and memory networks (e.g., DMN [xiong2016dynamic], HeteroMemory [DBLP:conf/cvpr/FanZZW0H19]) focused on task-specific visual reasoning. With the advent of Transformers and large-scale pretraining, Vision-LLMs (VLMs) emerged, unifying representation, alignment, and fusion, leading to improved implicit reasoning on tasks like VQA [DBLP:conf/cvpr/GoyalKSBP17] and visual grounding [lai2024lisa]. These VLMs included dual-encoder architectures (e.g., CLIP [DBLP:conf/icml/RadfordKHRGASAM21]), single-transformer backbones (e.g., UNITER [DBLP:conf/eccv/ChenLYK0G0020], OFA [wang2022ofa]), and Vision-Encoder-LLM models (e.g., LLaVA [DBLP:conf/nips/LiuLWL23a]). While improving perception and feature integration, reasoning largely remained implicit and limited in depth and generalization, often relying on a classification-based paradigm.
- Language-Centric Short Reasoning (System-1 Reasoning): The rise of Multimodal LLMs (MLLMs) shifted the paradigm to end-to-end, language-centric frameworks. This stage is characterized by the emergence of Multimodal Chain-of-Thought (MCoT), which transforms implicit reasoning into explicit, intermediate steps generated by the LLM component.
- Prompt-based MCoT: These methods use carefully crafted prompts (few-shot or zero-shot) to elicit step-by-step reasoning. Examples include "see-think-confirm" for visual reasoning [chen2023see], temporal reasoning prompts for videos [himakunthala2023let], and task-specific prompts for domains like autonomous driving [luo2024pkrd] or GUI navigation [tang2023cotdet]. These offer interpretability with minimal training.
- Structural Reasoning: This involves supervised training to learn explicit procedural structures for multimodal reasoning. Approaches include learning rationales (e.g., Multimodal-CoT [zhang2023multimodal]), adapting defined textual procedures (e.g., breaking tasks into perception and decision stages [gao2024cantor], or multi-phase processes [luan2024textcot]), and incorporating modality-specific structures (e.g., region-based grounding [liu2024chain], text-guided semantic enrichment [chen2023shikra], or embodied reasoning chains [zawalski2024robotic]). These methods aim for more standardized and reliable reasoning.
- Externally Augmented Reasoning: This paradigm enhances reasoning by integrating external algorithms, tools, or expert modules. This includes search algorithm-enhanced MCoT (e.g., using DFS/BFS [gomez2023mmtot] or graph-based search [yao2023HoT]) to explore reasoning paths, leveraging external textual tools (e.g., program generation [gupta2023visual], dynamic tool orchestration [ke2024hydra], or generating intermediate images [meng2023chain]), incorporating Retrieval-Augmented Generation (RAG) (e.g., for fact-checking [khaliq2024ragar], information retrieval [pan2024chain], or using knowledge graphs [mondal2024kam]), and employing specialized multimodal tools or experts (e.g., visual encoders [yao2023beyond] or scene graph generators [mitra2024compositional]). These methods enable more flexible and scalable reasoning, especially for tasks requiring external knowledge or precise grounding. However, reasoning at this stage often remains short and reactive, characteristic of System-1 processing, struggling with long-horizon planning and abstract compositionality.
- Language-Centric Long Reasoning (System-2 Thinking and Planning): To address the limitations of short, reactive reasoning, research moves towards System-2-inspired deliberate and compositional processes.
- Cross-Modal Reasoning: This focuses on dynamic integration and reasoning across multiple modalities. Methods utilize external tools and algorithms (e.g., program execution [choudhury2023zero], dynamic tool orchestration [gao2023assistgpt], interleaving visual/textual steps based on algorithms [sun2024visual]), or enhance model-intrinsic capabilities to generate/infer multimodal information (e.g., fine-tuning on multimodal CoT data [DBLP:conf/aaai/WangHHXLLS24], refining visual-textual representations [li2025imagine]).
- Multimodal-O1: Inspired by models like OpenAI's o1, these models enhance reasoning through CoT fine-tuning and testing-time scaling strategies. They often adopt multi-stage reasoning structures (e.g., Summary, Caption, Thinking, Answer) and use planning algorithms like Beam Search or Monte Carlo Tree Search (MCTS) (e.g., Marco-o1 [DBLP:journals/corr/abs-2411-14405], llamaberry [zhang2024llamaberry]) to explore better reasoning paths.
- Multimodal-R1: This emerging paradigm leverages reinforcement learning, particularly DPO [yu2024rlhf] and GRPO [DBLP:journals/corr/abs-2501-12948], to improve the reasoning capability of MLLMs. By training models with preference data or multi-modal feedback, these methods enhance reasoning depth, coherence, and domain adaptability (e.g., for mathematical problems [chen2025r1v], visual grounding/detection [shen2025vlmr1], or video/audio tasks [feng2025video], [zhao2025r1omni]).
- Towards Native Large Multimodal Reasoning Model (Prospect): The paper prospects a future paradigm shift towards N-LMRMs, which are natively designed for unified multimodal understanding, generation, and agentic reasoning across any modality. Current LMRMs, despite progress, are limited by their language-centric retrofitting, predominant focus on vision/language, and underdeveloped interactive/long-horizon reasoning in dynamic environments.
- Experimental Findings: Evaluation on omni-modal benchmarks (e.g., OmniMMI [wang2025omnimmi]) and agent benchmarks (e.g., OSWorld [DBLP:conf/nips/XieZCLZCHCSLLXZ24], VisualWebArena [koh2024visualwebarena]) reveals significant limitations in effectively processing diverse modalities and performing complex, interactive tasks, even for models like GPT-4o and Gemini-1.5-Pro. Case studies with OpenAI o3 and o4-mini [openai2025o3o4] show improved multimodal CoT and tool use but highlight issues like interference between modalities, struggles with file/multimedia handling, and fabricated reasoning.
- Capabilities of N-LMRMs: N-LMRMs are envisioned with Multimodal Agentic Reasoning (proactive, goal-driven interaction, hierarchical planning, real-time adaptation, embodied learning) and Omni-Modal Understanding and Generative Reasoning (unified representation for heterogeneous data fusion, contextual multimodal generation, modality-agnostic inference). This involves exploring agentic models (e.g., R1-Searcher [song2025r1], Magma [yang2025magma]) and omni-modal models (e.g., M2-omni [guo2025m2], MiniCPM-o [team2025minicpm]), though current models only explore subsets of these capabilities.
- Technical Prospects: Building N-LMRMs requires addressing fundamental challenges: designing unified architectures for seamless cross-modal fusion (potentially using MoE [chen2024octavius]), enabling interleaved multimodal long CoT for dynamic test-time scaling across modalities, and developing systems that learn and evolve from real-world experiences through continuous interaction and potentially online reinforcement learning [qin2025ui]. High-quality data synthesis pipelines are also crucial for training, extending beyond current unimodal/cross-modal focuses to multimodal interaction chains and multi-tool coordination.
The survey also includes a detailed categorization of multimodal datasets and benchmarks, organized by the capabilities they evaluate: Understanding (Visual-centric, Audio-centric), Generation (Cross-modal, Joint Multimodal), Reasoning (General Visual, Domain-Specific), and Planning (GUI, Embodied/Simulated). It also summarizes common evaluation methods, including Exact/Fuzzy Match, Option Matching, LLM/MLLM Scoring, and Agentic Evaluation.
In conclusion, the paper charts the progress of LMRMs from modular perception to language-centric reasoning, highlighting the increasing importance of explicit, structured, and externally augmented reasoning. It identifies key remaining challenges, particularly in visual-centric long reasoning and interactive multimodal reasoning. The survey proposes Native LMRMs as a future direction, emphasizing unified omni-modal perception, interactive generative reasoning, and agentic behavior learned from world experiences, pointing towards foundational research needed to achieve truly adaptive and comprehensive AI systems.