ScreenSpot-V2: GUI Grounding Benchmark

Updated 1 October 2025

ScreenSpot-V2 is a benchmark for GUI grounding that provides unambiguous mappings of natural language instructions to pixel-level on-screen targets.
It employs advanced continuous reward modeling and coverage metrics to refine vision-language model performance across mobile and desktop platforms.
Empirical results and leaderboards demonstrate state-of-the-art strategies, including Gaussian reward techniques and inference-time scaling for enhanced GUI interaction.

ScreenSpot-V2 refers to a prominent benchmark for evaluating Graphical User Interface (GUI) grounding—the task of mapping natural language instructions or textual cues to precise on-screen targets, typically as pixel coordinates or bounding regions. ScreenSpot-V2 evolved to address annotation ambiguities found in the earlier ScreenSpot benchmark, covering both mobile and desktop environments with rigorous, unambiguous test data for high-fidelity assessment. State-of-the-art work on ScreenSpot-V2 provides critical milestones for the development of robust vision-LLMs (VLMs), GUI agents, and grounding methodologies. The following sections provide a comprehensive technical overview of ScreenSpot-V2: its benchmark characteristics, leading model approaches, core methodological innovations, empirical results, and broader research significance.

1. Benchmark Composition and Evaluation Protocol

ScreenSpot-V2 consists of a diverse set of pre-cropped GUI images—sampled from mobile, desktop, and web-based platforms—accompanied by natural language instructions and corresponding ground-truth coordinates or bounding boxes indicating the target GUI elements. The dataset is curated to minimize annotation ambiguity, providing clear mappings between instructions and GUI targets. Evaluation typically adopts accuracy measures defined by whether the model-predicted coordinates fall within a specified tolerance of the annotated target region.

The test suite is widely used to assess both “point-based” (predicting the central click coordinate) and “region-based” (predicting bounding box) grounding strategies. Model performance is often compared in terms of average accuracy over all test cases, and versioned leaderboards track incremental improvements as new techniques emerge (Hsieh et al., 30 Jun 2025, Tang et al., 21 Jul 2025, Lian et al., 29 Jul 2025, Du et al., 7 Aug 2025, Wang et al., 4 Sep 2025).

2. Methodological Innovations: Reward Modeling and Training Paradigms

ScreenSpot-V2 has become a catalyst for several methodological advances in GUI grounding. A key development is the shift from sparse, binary reward formulations—where only exact target hits are rewarded—to dense, continuous reward functions that provide nuanced spatial feedback during model training (Tang et al., 21 Jul 2025, Lian et al., 29 Jul 2025).

Continuous Gaussian Reward Modeling: GUI-G² models the target region as a 2D Gaussian distribution, yielding a reward function

$R_\mathrm{point} = \exp\left\{ -\frac{1}{2}\left( \frac{(c_x^p - c_x^\mathrm{gt})^2}{\sigma_x^2} + \frac{(c_y^p - c_y^\mathrm{gt})^2}{\sigma_y^2} \right) \right\}$

with adaptive variance $\sigma_x, \sigma_y$ proportional to the element size. This modeling provides smooth, informative gradients and aligns with observed human click distributions.

Coverage Rewards and Bhattacharyya Coefficients: To reward spatial alignment (not just center precision), methods compute the overlap between predicted and target Gaussian regions using closed-form statistics.
Self-Evolving Preference Optimization: LASER leverages Monte Carlo rollouts and IoU-based region diversity to select instruction-relevant, high-quality focus regions. Direct Preference Optimization (DPO) is applied to preference pairs without requiring human-annotated labels (Wang et al., 4 Sep 2025).
Inference-Time Scaling and Selection: UI-AGILE and related frameworks employ image decomposition during inference, splitting large screenshots into subregions to manage high resolution and visual noise. Candidate outputs are adjudicated by VLM-based scorers (Lian et al., 29 Jul 2025).

3. Leading Models and Fine-Tuning Strategies

Several model families establish accuracy milestones on ScreenSpot-V2, each employing distinct strategies to address sample efficiency, adaptation, and grounding precision.

ZonUI-3B: A 3B-parameter, LoRA-adapted VLM, trained via a two-stage strategy (cross-platform pretraining, followed by resolution-targeted fine-tuning). The corpus is constructed from diverse sources, with redundancy reduction through random sampling. Balanced training across platforms and resolutions increases overall and desktop-specific accuracy (Hsieh et al., 30 Jun 2025).
GUI-G²: Introduces continuous Gaussian reward modeling for reinforcement learning in GUI agents. An adaptive variance mechanism ensures rewards scale with variable element sizes, yielding stable convergence and minimizing discontinuities during optimization. Experimentally, +5.9% accuracy improvement over sparse-reward baselines is reported (Tang et al., 21 Jul 2025).
UI-AGILE: Integrates three training innovations—continuous grounding rewards (Chebyshev-centered), “Simple Thinking” rewards to optimize reasoning length, and cropping-based resampling for curriculum-inspired learning. During inference, decomposed grounding with VLM-based selection allows precise localization in high-resolution screens (Lian et al., 29 Jul 2025).
GUI-RC / GUI-RCPO: Provide test-time performance gains with no additional data or retraining. GUI-RC builds a spatial consensus map via multi-sample voting, while GUI-RCPO leverages the consensus as a self-supervised reward for policy optimization. GUI-RCPO enables iterative self-improvement and is shown to elevate performance by ≈5% on ScreenSpot-V2 (Du et al., 7 Aug 2025).
LASER: Endows VLMs with active, multi-step cropping and click prediction. Preference optimization uses Monte Carlo and IoU quality signals to create high-value focus regions. LASER demonstrates that a self-evolving, reasoning-adaptive approach can outperform larger models on both ScreenSpot-v2 and full-screen ScreenSpot-Pro (Wang et al., 4 Sep 2025).

4. Empirical Results and Benchmark Impact

The following table summarizes characteristic ScreenSpot-V2 accuracy results (as reported in the source literature):

Model/Method	Params	Reward/Training	Accuracy (%)
ZonUI-3B	3B	SFT, LoRA, 2-stage FT	86.4
GUI-G²-7B	7B	Gaussian RL	93.3
UI-AGILE-7B	7B	Cont. reward+Resample	92.1
GUI-RC-Qwen3B	3B	Test-time RC scaling	83.6
GUI-RCPO-Qwen3B	3B	Test-time RL	85.1
LASER-GTA1-7B	7B	Active DPO self-opt.	SoTA

All values are derived directly from the cited studies (Hsieh et al., 30 Jun 2025, Tang et al., 21 Jul 2025, Lian et al., 29 Jul 2025, Du et al., 7 Aug 2025, Wang et al., 4 Sep 2025). GUI-G² and UI-AGILE represent the current state-of-the-art, with LASER’s multi-step, adaptive reasoning achieving best-in-class results among 7B-scale models.

Ablation experiments consistently highlight that reward shaping (continuous, Gaussian, coverage), balanced and multi-resolution data sampling, and inference-time decomposition or consensus-based scaling are decisive factors for closing the gap between compact and large models.

5. Architectural Patterns and Implementation Practices

ScreenSpot-V2 benchmarks have driven the adoption of several model and training architectural choices:

Two-stage and Progressive Fine-Tuning: Models first acquire general GUI priors via cross-platform data before further specialization on target domains (e.g., high-resolution desktop) (Hsieh et al., 30 Jun 2025).
Efficient Adaptation Layers: LoRA adaptation allows compact VLMs to scale to the domain without full model updates.
Reinforcement Learning with Dense Perception Rewards: VLM policies are trained or refined by signals that reflect the actual continuous nature of click targets (Tang et al., 21 Jul 2025).
Test-Time Optimization: Region consistency voting (GUI-RC) and policy optimization (GUI-RCPO) adapt model responses at inference without extra annotation or retraining, enhancing robustness (Du et al., 7 Aug 2025).
Multi-step Active Reasoning: Rather than single-shot predictions, methods like LASER enable models to iteratively crop and focus, mirroring human interaction patterns in complex interfaces (Wang et al., 4 Sep 2025).

6. Research Directions and Significance

ScreenSpot-V2 serves as a crucible for the evaluation and development of machine perception in GUI contexts, particularly in:

Spatial Reasoning: By requiring precise alignment of instructions to on-screen elements, the benchmark reveals strengths and weaknesses in multimodal and vision-language reasoning.
Training Efficiency: The observed improvements from redundancy-reduced data sampling and self-supervised test-time optimization encourage exploration of data- and compute-efficient strategies.
Generalization and Robustness: The benchmark’s diversity across platforms, resolutions, and interface complexities tests cross-domain transferability and practical relevance.
Future Methodology: Techniques such as progressive active perception, reward shaping, and inference-time consensus aggregation are now central research themes for advancing GUI agent capabilities beyond ScreenSpot-V2.

ScreenSpot-V2 remains an influential benchmark for vision-language and interface reasoning research, underpinning design choices and performance targets for state-of-the-art GUI agents, with broader implications for automated software interaction, accessibility, and multimodal agent design.