Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
114 tokens/sec
Gemini 2.5 Pro Premium
26 tokens/sec
GPT-5 Medium
20 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
10 tokens/sec
DeepSeek R1 via Azure Premium
55 tokens/sec
2000 character limit reached

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks (2410.19100v3)

Published 24 Oct 2024 in cs.CV and cs.AI

Abstract: Videos are often used to learn or extract the necessary information to complete tasks in ways different than what text and static imagery alone can provide. However, many existing agent benchmarks neglect long-context video understanding, instead focusing on text or static image inputs. To bridge this gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. VideoWA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content. For our benchmark, we define a taxonomy of long-context video-based agent tasks with two main areas of focus: skill retention and factual retention. While skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently, the factual retention task evaluates whether an agent can retrieve instruction-relevant information from a video to complete a task. We find that the best model achieves 13.3% success on factual retention tasks and 45.8% on factual retention QA pairs, far below human performance at 73.9% and 79.3%, respectively. On skill retention tasks, long-context models perform worse with tutorials than without, exhibiting a 5% performance decrease in WebArena tasks and a 10.3% decrease in VisualWebArena tasks. Our work highlights the need to improve the agentic abilities of long-context multimodal models and provides a testbed for future development with long-context video agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
  2. Windows agent arena: Evaluating multi-modal os agents at scale, 2024. URL https://arxiv.org/abs/2409.08264.
  3. The impact of element ordering on lm agent performance, 2024. URL https://arxiv.org/abs/2409.12089.
  4. Mind2web: Towards a generalist agent for the web. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  5. Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718, 2024.
  6. Vila2: Vila augmented vila. arXiv preprint arXiv:2407.17453, 2024.
  7. Autoguide: Automated generation and selection of state-aware guidelines for large language model agents. arXiv preprint arXiv:2403.08978, 2024.
  8. Multimodal web navigation with instruction-finetuned foundation models. In International Conference on Learning Representations (ICLR), 2024.
  9. Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  10. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 2024 Annual Meeting of the Association for Computational Linguistics (ACL), 2024a.
  11. Tree search for language model agents, 2024b. URL https://arxiv.org/abs/2407.01476.
  12. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. arXiv preprint arXiv:2404.03648, 2024.
  13. Tvqa: Localized, compositional video question answering, 2019. URL https://arxiv.org/abs/1809.01696.
  14. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. URL https://arxiv.org/abs/2311.17005.
  15. Vila: On pre-training for visual language models. arXiv preprint arXiv:2023b, 2023.
  16. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  17. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023. URL https://arxiv.org/abs/2308.09126.
  18. OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024.
  19. Autonomous evaluation and refinement of digital agents, 2024. URL https://arxiv.org/abs/2404.06474.
  20. Large language models can self-improve at web agent tasks. arXiv preprint arXiv:2405.20309, 2024.
  21. Perception test: A diagnostic benchmark for multimodal video models, 2023. URL https://arxiv.org/abs/2305.13786.
  22. Robust speech recognition via large-scale weak supervision, 2022. URL https://arxiv.org/abs/2212.04356.
  23. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024.
  24. Ical: Continual learning of multimodal agents by transforming trajectories into actionable insights, 2024. URL https://arxiv.org/abs/2406.14596.
  25. Step: Stacked llm policies for web actions. arXiv preprint arXiv:2310.03720v2, 2024.
  26. Movieqa: Understanding stories in movies through question-answering, 2016. URL https://arxiv.org/abs/1512.02902.
  27. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861.
  28. Agent workflow memory, 2024. URL https://arxiv.org/abs/2409.07429.
  29. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024. URL https://arxiv.org/abs/2407.15754.
  30. Next-qa:next phase of question-answering to explaining temporal actions, 2021. URL https://arxiv.org/abs/2105.08276.
  31. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024.
  32. Longvila: Scaling long-context visual language models for long videos, 2024.
  33. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
  34. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
  35. X-vila: Cross-modality alignment for large language model. CoRR, abs/2405.19335, 2024.
  36. Android in the zoo: Chain-of-action-thought for gui agents. arXiv preprint arXiv:2403.02713, 2024a.
  37. Mmina: Benchmarking multihop multimodal internet agents. arXiv preprint arXiv:2404.09992, 2024b.
  38. Webarena: A realistic web environment for building autonomous agents. In International Conference on Learning Representations (ICLR), 2024.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube