EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents (2502.09560v3)

Published 13 Feb 2025 in cs.AI, cs.CL, and cs.CV

Abstract: Leveraging Multi-modal LLMs (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9\% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code and dataset are available at https://embodiedbench.github.io.

PDF Abstract

EmbodiedBench: A Benchmark for Evaluating Multi-modal LLMs as Vision-Driven Embodied Agents

The paper introduces EmbodiedBench, a comprehensive benchmark designed to assess the capabilities of Multi-modal LLMs (MLLMs) in vision-driven embodied agents. While LLMs and their role in language-centric tasks have been extensively studied, the paper addresses the under-explored area of MLLMs, especially concerning their application in tasks that require an understanding of both language and vision.

EmbodiedBench evaluates MLLMs across 1,128 diverse tasks distributed among four distinct environments: EB-ALFRED, EB-Habitat, EB-Navigation, and EB-Manipulation. These environments test the agents across a spectrum of scenarios requiring both high-level semantic understanding and low-level action execution. The paper proposes a structured evaluation framework to assess distinct capabilities in embodied agents, including commonsense reasoning, spatial awareness, and long-term planning.

The authors conduct extensive experiments on 13 leading proprietary and open-source MLLMs. Their results indicate that while current MLLMs perform well on high-level semantic tasks, they struggle significantly with low-level manipulation tasks. Notably, GPT-4o achieves an average score of 28.9% in low-level tasks, highlighting substantial room for improvement. Further, vision input is shown to be critical for the success of low-level tasks, with performance dropping by 40% to 70% when vision input is removed for tasks involving precise perception and spatial reasoning.

The implications of these findings are two-fold. Practically, they emphasize the need for more refined MLLM architectures that can effectively handle both high-level reasoning and low-level control. Theoretically, they invite further research into improving MLLMs' capabilities in spatial reasoning and manipulation. Future developments could involve better-integrating spatial information with LLMs to enhance performance in tasks requiring complex visual and spatial understanding.

A key contribution is the introduction of capability-oriented evaluation which allows for fine-grained assessments of multi-modal agents. This aspect is crucial for developing models tailored to specific embodied AI applications. The benchmark is expected to inspire future research directions, focusing on enhancing the adaptability of MLLMs in real-world environments, facilitating interactions that involve both linguistic and visual inputs.

In conclusion, EmbodiedBench provides a significant step toward understanding and improving the performance of MLLMs as embodied agents. By highlighting the current models' limitations, particularly in low-level manipulation, the research sets a clear agenda for future developments in this evolving field.

PDF Markdown Bookmark Chat (Pro)

Authors (13)

Rui Yang (221 papers)
Hanyang Chen (2 papers)
Junyu Zhang (64 papers)
Mark Zhao (10 papers)
Cheng Qian (81 papers)
Kangrui Wang (15 papers)
Qineng Wang (8 papers)
Teja Venkat Koripella (1 paper)
Marziyeh Movahedi (1 paper)
Manling Li (47 papers)
Heng Ji (266 papers)
Huan Zhang (171 papers)
Tong Zhang (569 papers)

Related Papers

Find Related Papers

GitHub

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Tweets

https://twitter.com/RuiYang70669025/status/1890256810474721357

https://twitter.com/ManlingLi_/status/1890481862893834486

https://twitter.com/gm8xx8/status/1891220317940391974

https://twitter.com/fly51fly/status/1890534546782781784

https://twitter.com/arxivsanitybot/status/1890588218950558088

https://twitter.com/arXivGPT/status/1890824658871840954