Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMInA: Benchmarking Multihop Multimodal Internet Agents (2404.09992v1)

Published 15 Apr 2024 in cs.CV, cs.AI, and cs.CL
MMInA: Benchmarking Multihop Multimodal Internet Agents

Abstract: Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) LLMs and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both the single-hop and multihop web browsing abilities of agents. See our code and data at https://mmina.cliangyu.com

Evaluating Embodied Agents in Multihop, Multimodal Web Environments with MMInA

Introduction

The exploration of autonomous agents' capability to navigate and perform tasks in a multihop, multimodal web environment forms a critical juncture in AI research, particularly in the domain of Internet agents. The presented paper introduces MMInA, a new benchmark tailored for evaluating these agents' effectiveness in executing compositional Internet tasks across evolving real-world websites. This benchmark focuses on the agents' ability to interpret and act upon multimodal inputs (text and images) across multiple hops, necessitating advanced reasoning, planning, and execution over a series of interconnected web-based tasks.

Benchmark Design

MMInA goes beyond existing benchmarks by providing a realistic and challenging environment for agents, highlighted by three distinguishing features:

  • Evolving Real-World Websites: MMInA is grounded in the evolving landscape of real-world websites, increasing the benchmark's relevance and difficulty by presenting agents with up-to-date challenges reflective of actual user tasks.
  • Multihop Browsing: The benchmark comprises tasks that demand navigation and action across multiple websites, thus evaluating the agents' long-range reasoning and planning capabilities.
  • Holistic Evaluation: A novel evaluation protocol assesses agent performance across each phase of a multihop task, providing insights into the step-wise effectiveness and efficiency in task execution.

Multimodal Web Content and Multihop Design

MMInA explicitly addresses the challenges of multimodal web navigation by including tasks that necessitate the processing of both textual and visual web content. The benchmark comprises 1,050 tasks, simulated across 14 diverse websites, requiring agents to perform an average of 2.85 hops and execute 12.9 actions to complete tasks. This multihop, multimodal setup pushes the boundaries of current agent capabilities by assessing their performance in a complex, real-world akin scenario.

Experimental Insights

The extensive experimentation with state-of-the-art agents underscored the current limitations in solving multihop web tasks, highlighting a substantial performance gap compared to human benchmarks. Notably, standalone models like GPT-4V showed promising yet insufficient success rates, indicating the need for improved models capable of navigating the nuanced, multifaceted web environment MMInA presents.

Memory-augmented Agents

The research introduces a memory augmentation approach to enhance agent performance, a novel stride towards addressing the complex nature of multihop tasks. This method significantly improves both single-hop and multihop browsing capabilities by leveraging past action trajectories as a reflective mechanism, enabling agents to learn from previous interactions and adjust their strategies more effectively.

Implications and Future Directions

The findings from MMInA benchmarking reveal crucial insights into the challenges faced by current web agents in executing realistic, compositional tasks on the Internet. The demonstrated need for enhanced multimodal reasoning and memory mechanisms indicates significant opportunities for future research. The introduction of memory-augmented agents suggests a promising direction towards developing more proficient models capable of addressing the complexities inherent in real-world web navigation and task execution.

In conclusion, MMInA represents a pivotal step forward in the evaluation and development of autonomous agents for the web, providing a robust framework for assessing and enhancing their capabilities in a realistic, challenging, and ever-changing environment. The future development of AI agents, bolstered by advancements in multimodal understanding, memory augmentation, and adaptive reasoning, will undoubtedly herald a new era of autonomous Internet navigation and task completion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2023.
  2. Gpt-4v(ision) system card. 2023.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  4. Introducing our multimodal models, 2023.
  5. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  7. Erica: Interaction mining mobile apps. In Proceedings of the 29th annual symposium on user interface software and technology, pages 767–776, 2016.
  8. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
  9. Razvan V Florian. Autonomous artificial intelligent agents. Center for Cognitive and Neural Studies (Coneural), Cluj-Napoca, Romania, 2003.
  10. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  11. Improving language understanding from screenshots, 2024.
  12. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370, 2023.
  13. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024.
  14. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  15. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553, 2024.
  16. Dual-view visual contextualization for web navigation. arXiv preprint arXiv:2402.04476, 2024.
  17. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024.
  18. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
  19. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  21. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023.
  22. Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802, 2018.
  23. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  24. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023.
  25. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960, 2023.
  26. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024.
  27. Pattie Maes. Modeling adaptive autonomous agents. Artificial life, 1(1_2):135–162, 1993.
  28. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023.
  29. Gpt-4 technical report, 2024.
  30. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  31. Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13, 2022.
  32. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  33. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018.
  34. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  35. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021.
  36. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  37. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  38. Language modelling with pixels, 2023.
  39. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  40. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017.
  41. The artificial life route to artificial intelligence: Building embodied, situated agents. Routledge, 2018.
  42. Gemini: A family of highly capable multimodal models, 2023.
  43. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021.
  44. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024.
  45. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  46. Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622, 2024.
  47. Grounding open-domain instructions to automate web support tasks. arXiv preprint arXiv:2103.16057, 2021.
  48. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
  49. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
  50. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  51. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103, 2023.
  52. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
  53. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
  54. Tom Ziemke. Adaptive behavior in autonomous agents. Presence, 7(6):564–587, 1998.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ziniu Zhang (3 papers)
  2. Shulin Tian (5 papers)
  3. Liangyu Chen (50 papers)
  4. Ziwei Liu (368 papers)
Citations (9)