Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
9 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WebClick Evaluation Dataset

Updated 30 June 2025
  • WebClick Evaluation Dataset is a benchmark offering real web page screenshots and textual instructions to test precise UI element localization.
  • It integrates multiple data sources, including agentic and human interaction logs, to reliably measure click accuracy in diverse web environments.
  • The dataset’s structured format and open release accelerate advances in web navigation, vision-language integration, and agentic system design.

The WebClick Evaluation Dataset refers to a class of datasets and benchmarks centered on the interaction between users (or agents) and web interfaces, with a focus on click behavior, web UI localization, and log-based relevance signals. These datasets have become central to advances in information retrieval, web navigation by agents, and user interface understanding, offering large-scale empirical foundations for academic research and system development.

1. Definitions and Scope

The WebClick Evaluation Dataset, as exemplified most recently by the “WebClick” benchmark (2506.02865), denotes a structured, open-access collection designed to evaluate the ability of models—especially vision-language agents—to localize, interpret, and act upon elements within web interfaces, using real or realistic web page screenshots paired with action-oriented instructions. This contrasts with prior click log datasets focused primarily on query-document relevance signals (e.g., ORCAS (2006.05324), TripClick (2103.07901), MS MARCO Web Search (2405.07526), CWRCzech (2405.20994)), in that WebClick centers specifically on interface-level interaction localization as a primitive for agentic autonomy.

2. Dataset Structure and Collection

2.1 Data Format

Each instance in the WebClick Evaluation Dataset is comprised of:

  • A screenshot of a real web page, captured either during human or agentic task execution.
  • A textual instruction describing the UI element to be localized or the action to be performed (e.g., “Click the Schedule button,” “Open June in the calendar”).
  • An annotated bounding box indicating the ground-truth region corresponding to the target UI element for interaction.

This structure generalizes and extends the “screenspot” paradigm, which is foundational in GUI localization.

2.2 Sources and Domains

The dataset amalgamates data from:

  • Agentic traces: Web pages interacted with and screenshots collected by automated agents performing multi-step web tasks (as in WebVoyager).
  • Human navigation logs: Real user or annotator sessions performing web actions.
  • Calendar-specific subsets: Isolated screenshots and annotation of complex, dynamic calendar UI elements, known to be particularly challenging for localization models.

The most recent WebClick release includes 1,639 screenshots spanning more than 100 distinct websites. Emphasis is placed on capturing a wide range of web genres, layouts, and interaction patterns, with special attention to tasks where general-purpose vision-LLMs exhibit high error rates (e.g., nested menus, dynamic content, non-standard controls) (2506.02865).

3. Evaluation Metrics and Benchmarks

3.1 Primary Metric: Click Accuracy

The principal measure is click accuracy, defined as the proportion of model-predicted (x, y) click coordinates that fall inside the annotated bounding box, given both the screenshot and textual instruction.

3.2 Subset Evaluation

Performance is reported across distinct splits:

  • Human-origin subset: Data derived purely from human web task execution.
  • Agent-collected subset: Data generated through agent interaction.
  • Calendar subset: Web-based calendar tasks which represent a key difficulty point for existing models.

Results are expressed as a percentage, with state-of-the-art performance on the human subset exceeding 88% and on the calendar subset just above 65%, substantiating the heightened challenge posed by the latter.

Model Agent (%) Calendar (%) Human (%) Avg (%)
Holo1-3B 83.02 65.91 88.80 73.55
Qwen2.5-VL-3B 76.26 51.70 85.07 65.51
UGround-V1-2B 84.41 50.76 78.50 67.15

This tabulation underscores both the overall difficulty and the discriminative power of specialized subsets in benchmarking models.

4. Significance for Agentic and Vision-Language Systems

The WebClick Evaluation Dataset is distinguished from traditional click log IR datasets by its utility as a fine-grained testbed for UI localization capabilities within web navigation agents. Its introduction enables several advances:

  • Targeted evaluation of state-to-action components in agent architectures, specifically the grounding of language in web UI context (2506.02865):

(thoughtt+1,notest+1,actiont+1)π(task,{thoughtk,actionk,notesk,screenshott3<kt})(\text{thought}_{t+1}, \text{notes}_{t+1}, \text{action}_{t+1}) \sim \pi\left(\text{task}, \{\text{thought}_k, \text{action}_k, \text{notes}_k, \text{screenshot}_{t-3 < k \leq t}\}\right)

The dataset provides gold-standard labels for the action output, i.e., predicting correct click location.

  • Diagnosis of hard failure cases—calendars, dynamic menus—where generalist VLMs underperform, guiding architecture and data augmentation improvements.
  • Reduction of feedback cycles for system development: Benchmarking the localization module enables measurable, reproducible iteration in the design of end-to-end web agents (e.g., Surfer-H powered by Holo1), where localization accuracy directly translates to overall task completion rates.

5. Relationship to Prior WebClick/Click Log Datasets

It is important to distinguish the WebClick Evaluation Dataset described above from traditional web click log datasets used in Information Retrieval:

  • ORCAS (2006.05324), TripClick (2103.07901), MS MARCO Web Search (2405.07526), and CWRCzech (2405.20994) provide large-scale query-document click graphs for learning-to-rank, semantic relatedness, or click modeling. These focus on textual search and use click data to infer document relevance.
  • The “WebClick” Evaluation Dataset uniquely centers on visual UI interaction: mapping multi-modal context (screenshot + text) to spatial UI targets as a primitive for agentic behavior.

The two classes of datasets complement each other: WebClick supports the development and evaluation of web navigation agents at the UI/action granularity, whereas classic click log datasets address document retrieval relevance and search modeling.

6. Implications and Accessibility

The open release of WebClick (2506.02865) under a permissive license enables broad benchmarking for the vision-language, reinforcement learning, and web automation communities. The inclusion of agentic and human traces supports robust analysis of generalization, training on synthetic or agentic traces vs. human-distribution tasks, and comparative evaluation against other UI localization benchmarks (e.g., Screenspot, GroundUI).

By providing challenging, targeted, and reproducible evaluation scenarios, WebClick is positioned to accelerate methodological advances in web agent research, promote transparent comparison across models and systems, and inform the design of future agent architectures for open web environments.

References

  • “Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights” (2506.02865)
  • “ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search” (2006.05324)
  • “TripClick: The Log Files of a Large Health Web Search Engine” (2103.07901)
  • “MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels” (2405.07526)
  • “CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance Ranking” (2405.20994)