Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

51 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

258

MMInA: Benchmarking Multihop Multimodal Internet Agents (2404.09992v1)

Published 15 Apr 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) LLMs and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both the single-hop and multihop web browsing abilities of agents. See our code and data at https://mmina.cliangyu.com

PDF HTML Abstract

Evaluating Embodied Agents in Multihop, Multimodal Web Environments with MMInA

Introduction

The exploration of autonomous agents' capability to navigate and perform tasks in a multihop, multimodal web environment forms a critical juncture in AI research, particularly in the domain of Internet agents. The presented paper introduces MMInA, a new benchmark tailored for evaluating these agents' effectiveness in executing compositional Internet tasks across evolving real-world websites. This benchmark focuses on the agents' ability to interpret and act upon multimodal inputs (text and images) across multiple hops, necessitating advanced reasoning, planning, and execution over a series of interconnected web-based tasks.

Benchmark Design

MMInA goes beyond existing benchmarks by providing a realistic and challenging environment for agents, highlighted by three distinguishing features:

Evolving Real-World Websites: MMInA is grounded in the evolving landscape of real-world websites, increasing the benchmark's relevance and difficulty by presenting agents with up-to-date challenges reflective of actual user tasks.
Multihop Browsing: The benchmark comprises tasks that demand navigation and action across multiple websites, thus evaluating the agents' long-range reasoning and planning capabilities.
Holistic Evaluation: A novel evaluation protocol assesses agent performance across each phase of a multihop task, providing insights into the step-wise effectiveness and efficiency in task execution.

Multimodal Web Content and Multihop Design

MMInA explicitly addresses the challenges of multimodal web navigation by including tasks that necessitate the processing of both textual and visual web content. The benchmark comprises 1,050 tasks, simulated across 14 diverse websites, requiring agents to perform an average of 2.85 hops and execute 12.9 actions to complete tasks. This multihop, multimodal setup pushes the boundaries of current agent capabilities by assessing their performance in a complex, real-world akin scenario.

Experimental Insights

The extensive experimentation with state-of-the-art agents underscored the current limitations in solving multihop web tasks, highlighting a substantial performance gap compared to human benchmarks. Notably, standalone models like GPT-4V showed promising yet insufficient success rates, indicating the need for improved models capable of navigating the nuanced, multifaceted web environment MMInA presents.

Memory-augmented Agents

The research introduces a memory augmentation approach to enhance agent performance, a novel stride towards addressing the complex nature of multihop tasks. This method significantly improves both single-hop and multihop browsing capabilities by leveraging past action trajectories as a reflective mechanism, enabling agents to learn from previous interactions and adjust their strategies more effectively.

Implications and Future Directions

The findings from MMInA benchmarking reveal crucial insights into the challenges faced by current web agents in executing realistic, compositional tasks on the Internet. The demonstrated need for enhanced multimodal reasoning and memory mechanisms indicates significant opportunities for future research. The introduction of memory-augmented agents suggests a promising direction towards developing more proficient models capable of addressing the complexities inherent in real-world web navigation and task execution.

In conclusion, MMInA represents a pivotal step forward in the evaluation and development of autonomous agents for the web, providing a robust framework for assessing and enhancing their capabilities in a realistic, challenging, and ever-changing environment. The future development of AI agents, bolstered by advancements in multimodal understanding, memory augmentation, and adaptive reasoning, will undoubtedly herald a new era of autonomous Internet navigation and task completion.

PDF Markdown Bookmark Chat (Pro)

References (54)

Authors (4)

Ziniu Zhang (3 papers)
Shulin Tian (5 papers)
Liangyu Chen (50 papers)
Ziwei Liu (368 papers)

Citations (9)

View on Semantic Scholar

Tweets

https://twitter.com/cliangyu_/status/1780097331473645843

https://twitter.com/liuziwei7/status/1780597079154589955

https://twitter.com/fly51fly/status/1782052409759707619

https://twitter.com/gm8xx8/status/1780067571061924289

https://twitter.com/arxivsanitybot/status/1780591328658681949