VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks (2410.19100v3)
Abstract: Videos are often used to learn or extract the necessary information to complete tasks in ways different than what text and static imagery alone can provide. However, many existing agent benchmarks neglect long-context video understanding, instead focusing on text or static image inputs. To bridge this gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. VideoWA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content. For our benchmark, we define a taxonomy of long-context video-based agent tasks with two main areas of focus: skill retention and factual retention. While skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently, the factual retention task evaluates whether an agent can retrieve instruction-relevant information from a video to complete a task. We find that the best model achieves 13.3% success on factual retention tasks and 45.8% on factual retention QA pairs, far below human performance at 73.9% and 79.3%, respectively. On skill retention tasks, long-context models perform worse with tutorials than without, exhibiting a 5% performance decrease in WebArena tasks and a 10.3% decrease in VisualWebArena tasks. Our work highlights the need to improve the agentic abilities of long-context multimodal models and provides a testbed for future development with long-context video agents.
- AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
- Windows agent arena: Evaluating multi-modal os agents at scale, 2024. URL https://arxiv.org/abs/2409.08264.
- The impact of element ordering on lm agent performance, 2024. URL https://arxiv.org/abs/2409.12089.
- Mind2web: Towards a generalist agent for the web. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718, 2024.
- Vila2: Vila augmented vila. arXiv preprint arXiv:2407.17453, 2024.
- Autoguide: Automated generation and selection of state-aware guidelines for large language model agents. arXiv preprint arXiv:2403.08978, 2024.
- Multimodal web navigation with instruction-finetuned foundation models. In International Conference on Learning Representations (ICLR), 2024.
- Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 2024 Annual Meeting of the Association for Computational Linguistics (ACL), 2024a.
- Tree search for language model agents, 2024b. URL https://arxiv.org/abs/2407.01476.
- Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. arXiv preprint arXiv:2404.03648, 2024.
- Tvqa: Localized, compositional video question answering, 2019. URL https://arxiv.org/abs/1809.01696.
- Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. URL https://arxiv.org/abs/2311.17005.
- Vila: On pre-training for visual language models. arXiv preprint arXiv:2023b, 2023.
- Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023. URL https://arxiv.org/abs/2308.09126.
- OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024.
- Autonomous evaluation and refinement of digital agents, 2024. URL https://arxiv.org/abs/2404.06474.
- Large language models can self-improve at web agent tasks. arXiv preprint arXiv:2405.20309, 2024.
- Perception test: A diagnostic benchmark for multimodal video models, 2023. URL https://arxiv.org/abs/2305.13786.
- Robust speech recognition via large-scale weak supervision, 2022. URL https://arxiv.org/abs/2212.04356.
- Androidworld: A dynamic benchmarking environment for autonomous agents, 2024.
- Ical: Continual learning of multimodal agents by transforming trajectories into actionable insights, 2024. URL https://arxiv.org/abs/2406.14596.
- Step: Stacked llm policies for web actions. arXiv preprint arXiv:2310.03720v2, 2024.
- Movieqa: Understanding stories in movies through question-answering, 2016. URL https://arxiv.org/abs/1512.02902.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861.
- Agent workflow memory, 2024. URL https://arxiv.org/abs/2409.07429.
- Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024. URL https://arxiv.org/abs/2407.15754.
- Next-qa:next phase of question-answering to explaining temporal actions, 2021. URL https://arxiv.org/abs/2105.08276.
- Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024.
- Longvila: Scaling long-context visual language models for long videos, 2024.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
- Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
- X-vila: Cross-modality alignment for large language model. CoRR, abs/2405.19335, 2024.
- Android in the zoo: Chain-of-action-thought for gui agents. arXiv preprint arXiv:2403.02713, 2024a.
- Mmina: Benchmarking multihop multimodal internet agents. arXiv preprint arXiv:2404.09992, 2024b.
- Webarena: A realistic web environment for building autonomous agents. In International Conference on Learning Representations (ICLR), 2024.