ScreenSpot-Pro: GUI Grounding Benchmark

Updated 29 December 2025

ScreenSpot-Pro is a benchmark for assessing grounding in professional, high-resolution desktop GUIs with minuscule, cluttered interface elements.
It evaluates multimodal models using 1,581 expert-annotated screenshot–instruction pairs across 23 applications in diverse professional domains.
Leveraging point-in-box accuracy and spatial reduction strategies, it drives advancements in complex GUI automation and interpretability.

ScreenSpot-Pro is a benchmark for assessing the grounding capabilities of multimodal vision-LLMs (MLLMs) in professional, high-resolution desktop software environments. It was introduced to expose and quantify the fundamental challenges faced by GUI grounding agents—especially in settings where target elements are visually minuscule, interfaces are cluttered, and context is highly complex. Unlike prior benchmarks focused on web or mobile use cases, ScreenSpot-Pro centers on authentic workflows from domains such as engineering, creative design, scientific analysis, and office productivity, providing a rigorous evaluation standard for agents intended for advanced desktop automation (Li et al., 4 Apr 2025).

1. Benchmark Definition and Design Principles

ScreenSpot-Pro targets the problem of grounding free-form natural language instructions to the precise locations of interactable GUI elements in high-fidelity, full-screen desktop screenshots. The benchmark consists of 1,581 unique screenshot–instruction pairs, each annotated by domain experts (with ≥5 years’ experience), covering 23 professional applications across five domains (development, creative, CAD & engineering, scientific, office productivity) and three operating systems (Windows, macOS, Linux). Annotation protocol requires delineating the exact clickable region and writing an accompanying instruction in real time; each instance receives double review for clarity and box precision (Li et al., 4 Apr 2025).

Typical screenshots exceed 1920×1080—often reaching up to 5120×2880 or with dual-monitor spans, sampled with OS scaling disabled to preserve element density. Key statistics include an average target size of just 0.07% of the screen area (versus 2.01% for the predecessor ScreenSpot dataset) and a text-to-icon split of 62.6%/37.4%. Approximately 35% of samples are at 2560×1440 or higher. This design stresses grounding models with small, densely packed control elements in semantically rich, multitasking environments.

2. Task Formulation and Evaluation Metrics

The primary task is end-to-end GUI grounding. Given a screenshot $I$ and an instruction $T$ , the model is required to produce a bounding box $\hat{b} = (\hat{x}_1, \hat{y}_1, \hat{x}_2, \hat{y}_2)$ such that the center $\hat{c} = ((\hat{x}_1+\hat{x}_2)/2, (\hat{y}_1+\hat{y}_2)/2)$ falls within the ground-truth annotated region $b^*$ . Accuracy is measured as the fraction of queries where this point-in-box condition holds:

$\text{Accuracy} = \frac{\#\{\hat{c} \in b^*\}}{\text{Total queries}}.$

Unlike object detection benchmarks that emphasize Intersection over Union (IoU) thresholds and multi-object localization, ScreenSpot-Pro’s design reflects agent use cases—what matters is where to click, not how well a region matches an ambiguous visual blob. As such, all evaluation is conducted using this point-in-box metric (Li et al., 4 Apr 2025, Tang et al., 21 Jul 2025, Hsieh et al., 30 Jun 2025, Lian et al., 29 Jul 2025).

In related work, IoU and mAP are calculated for additional diagnostic purposes, but the ScreenSpot-Pro leaderboard and most papers report only grounding accuracy.

3. Baseline Model Performance and Failure Modes

ScreenSpot-Pro presents a substantially more difficult challenge than prior GUI grounding datasets. Under the standard, uncropped full-screen evaluation, specialized 7B-parameter models achieve modest scores:

OS-Atlas-7B attains 18.9%.
UGround-7B 16.5%; AriaUI 11.3%.
Generalist MLLMs such as Qwen2-VL-7B and GPT-4o perform below 2%.

Below 4B, Qwen-GUI-3B achieves 28.7%, surpassing all previous models of similar size, while typical baselines like ShowUI-2B or UI-TARS-2B range from 7.7–27.7% (Hsieh et al., 30 Jun 2025).

Key failure types are:

Small target sizes: Accuracy degrades rapidly as box area decreases, due to both the spatial resolution bottlenecks and overwhelming visual context.
Icon-only targets: Models poorly handle non-textual elements, with less than 5% accuracy typical in icon-only grounding due to lack of cross-modal anchoring and absence of textual cues.
Contextual confusion and occlusion: Overlapping windows and external (non-relevant) documents frequently mislead both generalist and specialist models (Li et al., 4 Apr 2025, Lian et al., 29 Jul 2025).

4. Search Space Reduction: Heuristics and Planning-Based Approaches

A central empirical finding is that constraining the search area, even heuristically, produces significant gains. Three multi-round “planner-free” cropping strategies—applied to OS-Atlas-7B—are illustrative:

Iterative Zooming (predict cell in $2 \times 2$ patch grid, repeat): 31.0%.
Iterative Narrowing (progressively smaller crops centered on prior prediction): 31.9%.
ReGround (single crop around the initial prediction and repeat): 40.2%.

These baseline improvements over the single-pass 18.9% accuracy establish spatial reduction as a primary lever for improving GUI grounding (Li et al., 4 Apr 2025).

ScreenSeekeR builds on this through cascaded visual search, coupling a strong planner (GPT-4o) with grounding models. It leverages GPT-4o’s capability to hierarchically decompose instructions into ranked UI areas—for instance, “menu bar”, “properties panel”, or “neighbor elements”—and refines the search using recursively cropped candidate regions. Gaussian scoring is employed to vote on candidate locations, followed by non-maximum suppression and recursive refinement until the region is localized. This planner-guided reduction realizes 48.1% accuracy, a 20-point absolute gain over prior heuristics (Li et al., 4 Apr 2025).

Interpretability is enhanced, as ScreenSeekeR outputs search trails that reflect the hierarchical reasoning typical of expert users.

5. Advances in Reward Modeling and Training Protocols

GUI grounding training was traditionally driven by sparse, binary in-box rewards. GUI-G $^2$ reformulates this as a dense continuous optimization problem by modeling each target as a 2D Gaussian:

Gaussian Point Reward: $r_p(x) = \exp(-||x-\mu||^2 / 2\sigma^2)$ encourages center-focused predictions.
Gaussian Coverage Reward: Uses the Bhattacharyya coefficient to reward overlap between the predicted and ground-truth Gaussian distributions.
Adaptive Variance: Scales $\sigma_x, \sigma_y$ with the box size, supporting both small icons and large panels.

In reinforcement learning, rewards are summed ( $R_\text{total} = \nu R_\text{point} + \gamma R_\text{coverage}$ ) and used with Group Relative Policy Optimization (GRPO). GUI-G $^2$ achieves 47.5% accuracy on ScreenSpot-Pro, up from the UI-TARS-72B baseline (38.1%), and empirically displays improved robustness and generalization—especially in novel layouts (Tang et al., 21 Jul 2025).

UI-AGILE introduces further refinements:

Continuous Grounding Reward: Distance-decaying with the Chebyshev norm to focus learning on the element center.
“Simple Thinking” Reward: Regularizes reasoning chain length to minimize unnecessary chain-of-thought while avoiding degeneration in non-tap actions.
Cropping-Based Resampling: Dynamically rescales difficult samples to focus learning, salvaging 38% of zero-reward cases in the early epochs and yielding approximately 12% absolute gain.

At inference time, Decomposed Grounding with Selection splits high-res images into overlapping sub-images, independently grounds each, then selects via a general VLM back-end QA filter. The full pipeline achieves 48.7% accuracy—representing a 23% relative improvement over the best prior 7B baseline (JEDI-7B at 39.5%) (Lian et al., 29 Jul 2025).

6. Latest Architectures and State-of-the-Art Performance

The most recent progression utilizes active perception and self-evolving preference learning. LASER adopts a multi-step perception paradigm:

Single-/Multi-step Perception: Alternates between “Crop” and “Click” operations, adapting the number of steps to task complexity.
Monte Carlo Quality Estimation: Aggregates multiple rollouts to estimate the probability of correct guidance.
IoU-Based Diversity: Rewards trajectories that explore instruction-relevant but diverse regions.

Preference pairs are filtered by both high-confidence accuracy and sufficient diversity ( $\mathcal{R}_\text{acc} > \delta$ , $\mathcal{R}_\text{div} < \tau$ ), then used for Direct Preference Optimization (DPO). When fine-tuned on GTA1-7B, LASER attains 55.7% average accuracy on ScreenSpot-Pro—establishing a new state-of-the-art among 7B models (Wang et al., 4 Sep 2025).

7. Impact, Limitations, and Future Directions

ScreenSpot-Pro has established itself as the definitive stress-test for GUI grounding in professional domains. By exposing the shortcomings of existing approaches—particularly the inability to handle small, nontextual, and contextually entangled targets—ScreenSpot-Pro has reshaped both the evaluation and methodological landscape. Model advances, such as those integrating spatial reduction (ScreenSeekeR), dense reward shaping (GUI-G $^2$ ), inference decomposition (UI-AGILE), and active, multi-step perception (LASER), have collectively increased the screen grounding ceiling from sub-20% to above 55% (Li et al., 4 Apr 2025, Tang et al., 21 Jul 2025, Lian et al., 29 Jul 2025, Wang et al., 4 Sep 2025).

A key limitation remains in icon understanding and generalization to entirely novel professional GUIs. Ongoing research focuses on:

End-to-end planning, grounding, and closed-loop execution.
Training vision-LLMs on high-resolution, desktop-native GUI data to obviate the need for external planners.
Synthetic and domain-specific pretraining for improved recognition of complex and iconographic elements.

ScreenSpot-Pro’s annotated corpus and leaderboard continue to serve as a critical catalyst for developing practical, robust GUI agents suited for the demands of advanced desktop environments.