Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning (2401.06805v2)

Published 10 Jan 2024 in cs.CL and cs.AI

Abstract: Strong Artificial Intelligence (Strong AI) or AGI with abstract reasoning ability is the goal of next-generation AI. Recent advancements in LLMs, along with the emerging field of Multimodal LLMs (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yiqi Wang (39 papers)
  2. Wentao Chen (39 papers)
  3. Xiaotian Han (46 papers)
  4. Xudong Lin (37 papers)
  5. Haiteng Zhao (13 papers)
  6. Yongfei Liu (25 papers)
  7. Bohan Zhai (13 papers)
  8. Jianbo Yuan (33 papers)
  9. Quanzeng You (41 papers)
  10. Hongxia Yang (130 papers)
Citations (38)
X Twitter Logo Streamline Icon: https://streamlinehq.com