CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents (2407.01511v2)

Published 1 Jul 2024 in cs.AI

Abstract: The development of autonomous agents increasingly relies on Multimodal LLMs (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone environments. We evaluated four advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%. All framework code, agent code, and task datasets are publicly available at https://github.com/camel-ai/crab.

PDF Abstract

Overview of CRAB: Cross-environment Agent Benchmark for Multimodal LLM Agents

The paper "CRAB: Cross-environment Agent Benchmark for Multimodal LLM Agents" introduces a sophisticated framework for evaluating autonomous agents, known as CRAB. This benchmark is designed to overcome the limitations of existing benchmarks by supporting cross-environment tasks and adopting an advanced graph-based evaluation method.

Core Contributions

The paper presents several key contributions to the field of autonomous agents and multimodal LLMs (MLMs):

Cross-environment Tasks: CRAB is the first framework to facilitate agent benchmarking across multiple environments, allowing agents to interact with desktop computers and mobile devices through a standardized interface. This approach mirrors complex real-world applications more closely than traditional single-environment tasks.
Graph Evaluator: A novel graph-based evaluation method is employed, decomposing tasks into multiple sub-goals represented as nodes within a Directed Acyclic Graph (DAG). This method offers fine-grained evaluation metrics, accommodating diverse valid pathways to task completion and addressing limitations of goal-based and trajectory-based metrics.
Task and Evaluator Construction: The framework introduces a scalable task generation method using sub-task composition within a graph structure. This approach enables effective task construction and the creation of corresponding graph evaluators, enhancing the benchmark's flexibility and extensibility.
Comprehensive Benchmarking: CRAB is implemented with a Python interface across virtual environments, featuring 120 tasks in computer desktop and mobile phone settings. It supports a variety of MLMs and different agent system configurations.

Experimental Evaluation

The paper evaluates multiple advanced MLMs, including GPT-4o and GPT-4 Turbo, using various agent configurations (single-agent, multi-agent by functionality, and multi-agent by environment). The evaluation highlights several findings:

Performance Metrics: GPT-4o, configured as a single agent, achieved the highest completion ratio of 38.01%, demonstrating superior task completion capability. The paper introduces new metrics, such as Completion Ratio (CR) and Execution Efficiency (EE), which provide insightful measures of agent performance beyond traditional success rates.
Agent Structures: Single-agent configurations generally outperformed multi-agent systems, likely due to communication challenges inherent in multi-agent setups. However, the added complexity of multi-environment tasks underscores the importance of designing robust communication protocols within agent systems.
Implications for Future AI: The results suggest the need for ongoing advancements in autonomous agents capable of handling cross-platform interactions. Further development in improving communication efficacy and reducing information loss in multi-agent frameworks is recommended.

Practical and Theoretical Implications

CRAB's introduction sets a new standard for evaluating MLM-based autonomous agents by:

Enhancing Real-world Applicability: By supporting cross-environment interactions, CRAB benchmarks more realistic tasks reflecting how humans use technology in everyday scenarios.
Improving Evaluation Precision: The graph-based evaluator advances the precision of autonomous agent assessments, crucial for developing reliable systems.
Guiding Future Research: The findings provide valuable insights for future research directions, emphasizing the need for sophisticated communication strategies in multi-agent systems.

Conclusion

The CRAB benchmark represents a significant advancement in autonomous agent research, providing a robust framework for evaluating MLM agents across complex, multi-platform environments. Its innovative graph-based evaluation method and expansive task set promise to enhance our understanding and development of these systems, paving the way for future innovations in AI technology.

PDF Markdown Bookmark Chat (Pro)

Authors (16)

Tianqi Xu (11 papers)
Linyao Chen (5 papers)
Dai-Jie Wu (4 papers)
Yanjun Chen (22 papers)
Zecheng Zhang (45 papers)
Xiang Yao (4 papers)
Zhiqiang Xie (15 papers)
Yongchao Chen (18 papers)
Shilong Liu (60 papers)
Bochen Qian (1 paper)
Philip Torr (172 papers)
Bernard Ghanem (255 papers)
Guohao Li (43 papers)
Anjie Yang (1 paper)
Zhaoxuan Jin (2 papers)
Jianbo Deng (3 papers)

Citations (11)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - camel-ai/crab: Build LLM agent benchmark in a Python-centric way. (186 stars)

Tweets

https://twitter.com/AInews_wire/status/1825391275366810002

YouTube

Show All Videos