MMInA: Benchmarking Multihop Multimodal Internet Agents
Abstract: Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to extract multimodal information from web pages as observations autonomously; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) LLMs and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks with more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach that replays past action trajectories to reflect. Our method significantly improves the performance of both the single-hop and multihop web browsing abilities. Our code and data are available at github.com/shulin16/MMInA.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2023.
- Gpt-4v(ision) system card. 2023.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Introducing our multimodal models, 2023.
- Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Erica: Interaction mining mobile apps. In Proceedings of the 29th annual symposium on user interface software and technology, pages 767–776, 2016.
- Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
- Razvan V Florian. Autonomous artificial intelligent agents. Center for Cognitive and Neural Studies (Coneural), Cluj-Napoca, Romania, 2003.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Improving language understanding from screenshots, 2024.
- Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370, 2023.
- Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024.
- Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
- Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553, 2024.
- Dual-view visual contextualization for web navigation. arXiv preprint arXiv:2402.04476, 2024.
- Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
- Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023.
- Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802, 2018.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023.
- Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960, 2023.
- Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024.
- Pattie Maes. Modeling adaptive autonomous agents. Artificial life, 1(1_2):135–162, 1993.
- Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023.
- Gpt-4 technical report, 2024.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13, 2022.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
- Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
- Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Language modelling with pixels, 2023.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017.
- The artificial life route to artificial intelligence: Building embodied, situated agents. Routledge, 2018.
- Gemini: A family of highly capable multimodal models, 2023.
- Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021.
- Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622, 2024.
- Grounding open-domain instructions to automate web support tasks. arXiv preprint arXiv:2103.16057, 2021.
- Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
- Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103, 2023.
- Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
- Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
- Tom Ziemke. Adaptive behavior in autonomous agents. Presence, 7(6):564–587, 1998.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.