MMInA: Benchmarking Multihop Multimodal Internet Agents

Published 15 Apr 2024 in cs.CV, cs.AI, and cs.CL | (2404.09992v2)

Abstract: Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to extract multimodal information from web pages as observations autonomously; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) LLMs and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks with more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach that replays past action trajectories to reflect. Our method significantly improves the performance of both the single-hop and multihop web browsing abilities. Our code and data are available at github.com/shulin16/MMInA.

Abstract PDF HTML Upgrade to Chat

Authors (4)

References (54)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces MMInA, which benchmarks multimodal web agents on multihop tasks using evolving real-world websites as a testbed.
It employs a multihop browsing framework with 1,050 tasks across 14 websites, requiring agents to execute an average of 2.85 hops and 12.9 actions per task.
Memory augmentation is proposed to enhance agents' planning and reasoning, addressing significant performance gaps compared to human benchmarks.

Evaluating Embodied Agents in Multihop, Multimodal Web Environments with MMInA

Introduction

The exploration of autonomous agents' capability to navigate and perform tasks in a multihop, multimodal web environment forms a critical juncture in AI research, particularly in the domain of Internet agents. The presented study introduces MMInA, a new benchmark tailored for evaluating these agents' effectiveness in executing compositional Internet tasks across evolving real-world websites. This benchmark focuses on the agents' ability to interpret and act upon multimodal inputs (text and images) across multiple hops, necessitating advanced reasoning, planning, and execution over a series of interconnected web-based tasks.

Benchmark Design

MMInA goes beyond existing benchmarks by providing a realistic and challenging environment for agents, highlighted by three distinguishing features:

Evolving Real-World Websites: MMInA is grounded in the evolving landscape of real-world websites, increasing the benchmark's relevance and difficulty by presenting agents with up-to-date challenges reflective of actual user tasks.
Multihop Browsing: The benchmark comprises tasks that demand navigation and action across multiple websites, thus evaluating the agents' long-range reasoning and planning capabilities.
Holistic Evaluation: A novel evaluation protocol assesses agent performance across each phase of a multihop task, providing insights into the step-wise effectiveness and efficiency in task execution.

Multimodal Web Content and Multihop Design

MMInA explicitly addresses the challenges of multimodal web navigation by including tasks that necessitate the processing of both textual and visual web content. The benchmark comprises 1,050 tasks, simulated across 14 diverse websites, requiring agents to perform an average of 2.85 hops and execute 12.9 actions to complete tasks. This multihop, multimodal setup pushes the boundaries of current agent capabilities by assessing their performance in a complex, real-world akin scenario.

Experimental Insights

The extensive experimentation with state-of-the-art agents underscored the current limitations in solving multihop web tasks, highlighting a substantial performance gap compared to human benchmarks. Notably, standalone models like GPT-4V showed promising yet insufficient success rates, indicating the need for improved models capable of navigating the nuanced, multifaceted web environment MMInA presents.

Memory-augmented Agents

The research introduces a memory augmentation approach to enhance agent performance, a novel stride towards addressing the complex nature of multihop tasks. This method significantly improves both single-hop and multihop browsing capabilities by leveraging past action trajectories as a reflective mechanism, enabling agents to learn from previous interactions and adjust their strategies more effectively.

Implications and Future Directions

The findings from MMInA benchmarking reveal crucial insights into the challenges faced by current web agents in executing realistic, compositional tasks on the Internet. The demonstrated need for enhanced multimodal reasoning and memory mechanisms indicates significant opportunities for future research. The introduction of memory-augmented agents suggests a promising direction towards developing more proficient models capable of addressing the complexities inherent in real-world web navigation and task execution.

In conclusion, MMInA represents a pivotal step forward in the evaluation and development of autonomous agents for the web, providing a robust framework for assessing and enhancing their capabilities in a realistic, challenging, and ever-changing environment. The future development of AI agents, bolstered by advancements in multimodal understanding, memory augmentation, and adaptive reasoning, will undoubtedly herald a new era of autonomous Internet navigation and task completion.

Markdown Report Issue