Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models (2505.04921v1)

Published 8 May 2025 in cs.CV and cs.CL

Abstract: Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.

PDF Abstract

This paper, "Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models" (Li et al., 8 May 2025 ), provides a comprehensive overview of the evolution and current state of Large Multimodal Reasoning Models (LMRMs). It structures the field's development into a four-stage roadmap, emphasizing the shift from perception-driven, modular systems to unified, language-centric frameworks, and finally, towards a vision of natively multimodal, agentic models. The survey analyzes over 540 publications, discussing key technical challenges, capabilities, and future directions, grounded in experimental insights.

The paper proposes a four-stage roadmap for LMRMs:

Perception-Driven Modular Reasoning: Early approaches (pre-Transformer era) decomposed multimodal tasks into distinct modules for representation, alignment, fusion, and reasoning. Models like Neural Module Networks (NMN) [andreas2016neural] and those using variations of attention (e.g., MAC [hudson2d018compositional], MCAN [yu2019deep]) and memory networks (e.g., DMN [xiong2016dynamic], HeteroMemory [DBLP:conf/cvpr/FanZZW0H19]) focused on task-specific visual reasoning. With the advent of Transformers and large-scale pretraining, Vision-LLMs (VLMs) emerged, unifying representation, alignment, and fusion, leading to improved implicit reasoning on tasks like VQA [DBLP:conf/cvpr/GoyalKSBP17] and visual grounding [lai2024lisa]. These VLMs included dual-encoder architectures (e.g., CLIP [DBLP:conf/icml/RadfordKHRGASAM21]), single-transformer backbones (e.g., UNITER [DBLP:conf/eccv/ChenLYK0G0020], OFA [wang2022ofa]), and Vision-Encoder-LLM models (e.g., LLaVA [DBLP:conf/nips/LiuLWL23a]). While improving perception and feature integration, reasoning largely remained implicit and limited in depth and generalization, often relying on a classification-based paradigm.
Language-Centric Short Reasoning (System-1 Reasoning): The rise of Multimodal LLMs (MLLMs) shifted the paradigm to end-to-end, language-centric frameworks. This stage is characterized by the emergence of Multimodal Chain-of-Thought (MCoT), which transforms implicit reasoning into explicit, intermediate steps generated by the LLM component.
- Prompt-based MCoT: These methods use carefully crafted prompts (few-shot or zero-shot) to elicit step-by-step reasoning. Examples include "see-think-confirm" for visual reasoning [chen2023see], temporal reasoning prompts for videos [himakunthala2023let], and task-specific prompts for domains like autonomous driving [luo2024pkrd] or GUI navigation [tang2023cotdet]. These offer interpretability with minimal training.
- Structural Reasoning: This involves supervised training to learn explicit procedural structures for multimodal reasoning. Approaches include learning rationales (e.g., Multimodal-CoT [zhang2023multimodal]), adapting defined textual procedures (e.g., breaking tasks into perception and decision stages [gao2024cantor], or multi-phase processes [luan2024textcot]), and incorporating modality-specific structures (e.g., region-based grounding [liu2024chain], text-guided semantic enrichment [chen2023shikra], or embodied reasoning chains [zawalski2024robotic]). These methods aim for more standardized and reliable reasoning.
- Externally Augmented Reasoning: This paradigm enhances reasoning by integrating external algorithms, tools, or expert modules. This includes search algorithm-enhanced MCoT (e.g., using DFS/BFS [gomez2023mmtot] or graph-based search [yao2023HoT]) to explore reasoning paths, leveraging external textual tools (e.g., program generation [gupta2023visual], dynamic tool orchestration [ke2024hydra], or generating intermediate images [meng2023chain]), incorporating Retrieval-Augmented Generation (RAG) (e.g., for fact-checking [khaliq2024ragar], information retrieval [pan2024chain], or using knowledge graphs [mondal2024kam]), and employing specialized multimodal tools or experts (e.g., visual encoders [yao2023beyond] or scene graph generators [mitra2024compositional]). These methods enable more flexible and scalable reasoning, especially for tasks requiring external knowledge or precise grounding. However, reasoning at this stage often remains short and reactive, characteristic of System-1 processing, struggling with long-horizon planning and abstract compositionality.
Language-Centric Long Reasoning (System-2 Thinking and Planning): To address the limitations of short, reactive reasoning, research moves towards System-2-inspired deliberate and compositional processes.
- Cross-Modal Reasoning: This focuses on dynamic integration and reasoning across multiple modalities. Methods utilize external tools and algorithms (e.g., program execution [choudhury2023zero], dynamic tool orchestration [gao2023assistgpt], interleaving visual/textual steps based on algorithms [sun2024visual]), or enhance model-intrinsic capabilities to generate/infer multimodal information (e.g., fine-tuning on multimodal CoT data [DBLP:conf/aaai/WangHHXLLS24], refining visual-textual representations [li2025imagine]).
- Multimodal-O1: Inspired by models like OpenAI's o1, these models enhance reasoning through CoT fine-tuning and testing-time scaling strategies. They often adopt multi-stage reasoning structures (e.g., Summary, Caption, Thinking, Answer) and use planning algorithms like Beam Search or Monte Carlo Tree Search (MCTS) (e.g., Marco-o1 [DBLP:journals/corr/abs-2411-14405], llamaberry [zhang2024llamaberry]) to explore better reasoning paths.
- Multimodal-R1: This emerging paradigm leverages reinforcement learning, particularly DPO [yu2024rlhf] and GRPO [DBLP:journals/corr/abs-2501-12948], to improve the reasoning capability of MLLMs. By training models with preference data or multi-modal feedback, these methods enhance reasoning depth, coherence, and domain adaptability (e.g., for mathematical problems [chen2025r1v], visual grounding/detection [shen2025vlmr1], or video/audio tasks [feng2025video], [zhao2025r1omni]).
Towards Native Large Multimodal Reasoning Model (Prospect): The paper prospects a future paradigm shift towards N-LMRMs, which are natively designed for unified multimodal understanding, generation, and agentic reasoning across any modality. Current LMRMs, despite progress, are limited by their language-centric retrofitting, predominant focus on vision/language, and underdeveloped interactive/long-horizon reasoning in dynamic environments.
- Experimental Findings: Evaluation on omni-modal benchmarks (e.g., OmniMMI [wang2025omnimmi]) and agent benchmarks (e.g., OSWorld [DBLP:conf/nips/XieZCLZCHCSLLXZ24], VisualWebArena [koh2024visualwebarena]) reveals significant limitations in effectively processing diverse modalities and performing complex, interactive tasks, even for models like GPT-4o and Gemini-1.5-Pro. Case studies with OpenAI o3 and o4-mini [openai2025o3o4] show improved multimodal CoT and tool use but highlight issues like interference between modalities, struggles with file/multimedia handling, and fabricated reasoning.
- Capabilities of N-LMRMs: N-LMRMs are envisioned with Multimodal Agentic Reasoning (proactive, goal-driven interaction, hierarchical planning, real-time adaptation, embodied learning) and Omni-Modal Understanding and Generative Reasoning (unified representation for heterogeneous data fusion, contextual multimodal generation, modality-agnostic inference). This involves exploring agentic models (e.g., R1-Searcher [song2025r1], Magma [yang2025magma]) and omni-modal models (e.g., M2-omni [guo2025m2], MiniCPM-o [team2025minicpm]), though current models only explore subsets of these capabilities.
- Technical Prospects: Building N-LMRMs requires addressing fundamental challenges: designing unified architectures for seamless cross-modal fusion (potentially using MoE [chen2024octavius]), enabling interleaved multimodal long CoT for dynamic test-time scaling across modalities, and developing systems that learn and evolve from real-world experiences through continuous interaction and potentially online reinforcement learning [qin2025ui]. High-quality data synthesis pipelines are also crucial for training, extending beyond current unimodal/cross-modal focuses to multimodal interaction chains and multi-tool coordination.

The survey also includes a detailed categorization of multimodal datasets and benchmarks, organized by the capabilities they evaluate: Understanding (Visual-centric, Audio-centric), Generation (Cross-modal, Joint Multimodal), Reasoning (General Visual, Domain-Specific), and Planning (GUI, Embodied/Simulated). It also summarizes common evaluation methods, including Exact/Fuzzy Match, Option Matching, LLM/MLLM Scoring, and Agentic Evaluation.

In conclusion, the paper charts the progress of LMRMs from modular perception to language-centric reasoning, highlighting the increasing importance of explicit, structured, and externally augmented reasoning. It identifies key remaining challenges, particularly in visual-centric long reasoning and interactive multimodal reasoning. The survey proposes Native LMRMs as a future direction, emphasizing unified omni-modal perception, interactive generative reasoning, and agentic behavior learned from world experiences, pointing towards foundational research needed to achieve truly adaptive and comprehensive AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (22)

Yunxin Li (29 papers)
Zhenyu Liu (63 papers)
Zitao Li (21 papers)
Xuanyu Zhang (34 papers)
Zhenran Xu (12 papers)
Xinyu Chen (65 papers)
Haoyuan Shi (13 papers)
Shenyuan Jiang (2 papers)
Xintong Wang (30 papers)
Jifang Wang (4 papers)
Shouzheng Huang (3 papers)
Xinping Zhao (12 papers)
Borui Jiang (6 papers)
Lanqing Hong (72 papers)
Longyue Wang (87 papers)
Zhuotao Tian (38 papers)
Baoxing Huai (28 papers)
Wenhan Luo (88 papers)
Weihua Luo (63 papers)
Zheng Zhang (486 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1921686158603366797

https://twitter.com/theomitsa/status/1921118108892233731

https://twitter.com/theomitsa/status/1921118152655532339

https://twitter.com/TheTuringPost/status/1922735086283391174

https://twitter.com/LyxTg/status/1920808275290861611

https://twitter.com/morris_phd/status/1920775011805093924

YouTube

Show All Videos