VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? (2404.05955v1)

Published 9 Apr 2024 in cs.CL and cs.AI

Abstract: Multimodal LLMs (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. Existing benchmarks are either designed for general multimodal tasks, failing to capture the unique characteristics of web pages, or focus on end-to-end web agent tasks, unable to measure fine-grained abilities such as OCR, understanding, and grounding. In this paper, we introduce \bench{}, a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks. \bench{} consists of seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains. We evaluate 14 open-source MLLMs, Gemini Pro, Claude-3 series, and GPT-4V(ision) on \bench{}, revealing significant challenges and performance gaps. Further analysis highlights the limitations of current MLLMs, including inadequate grounding in text-rich environments and subpar performance with low-resolution image inputs. We believe \bench{} will serve as a valuable resource for the research community and contribute to the creation of more powerful and versatile MLLMs for web-related applications.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (45)

Authors (7)

Junpeng Liu (7 papers)
Yifan Song (48 papers)
Bill Yuchen Lin (72 papers)
Wai Lam (117 papers)
Graham Neubig (342 papers)
Yuanzhi Li (119 papers)
Xiang Yue (72 papers)

Citations (26)

View on Semantic Scholar

Tweets

https://twitter.com/teortaxesTex/status/1839081603097887018

https://twitter.com/gm8xx8/status/1778369035782009296

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? (2404.05955v1)

Related Papers

Tweets