Papers
Topics
Authors
Recent
2000 character limit reached

GUIZoom-Bench: A Zoom Impact Benchmark

Updated 12 December 2025
  • GUIZoom-Bench is a systematic benchmark that measures the effect of zoom on GUI grounding, employing fixed protocols and metrics to assess model performance.
  • It defines key metrics such as per-iteration correctness and categorizes samples into behavioral types to diagnose when zoom aids or hinders localization.
  • The benchmark’s rigorous methodology and detailed analysis offer actionable insights for developing adaptive zoom-aware training and inference strategies.

GUIZoom-Bench is a systematic benchmark designed to quantify and characterize the impact of zooming as a prior in graphical user interface (GUI) grounding tasks. Developed from the need to move beyond ad-hoc use of zoom-in and progressive cropping in GUI grounding, GUIZoom-Bench provides rigorous protocols, metrics, and behavioral categorizations. Its primary purpose is to enable researchers to diagnose, interpret, and compare model adaptability to zoom at both per-sample and population levels, thereby informing the design of advanced zoom-aware training and test-time inference strategies (Jiang et al., 5 Dec 2025).

1. Motivation and Conceptual Foundations

The under-explored potential of zoom in GUI grounding has been recognized, yet until the advent of GUIZoom-Bench, there was no framework enabling systematic measurement of when zoom benefits or harms model performance. Standard benchmarks failed to answer fundamental questions: when does zoom facilitate localization, when does it degrade it by removing necessary context or introducing distractors, and to what extent can a model’s performance be maximized through aggressive zoom application.

GUIZoom-Bench is intended to:

  • Quantify per-sample “first correctness depth” (the number of zoom steps until the model first localizes the target element correctly).
  • Measure “stability under further zoom” (whether correct localization persists or if the model becomes misled at deeper zooms).
  • Partition samples into interpretable categories that reveal “difficulty × reliability” trade-offs.
  • Serve as a diagnostic tool for evaluating zoom-aware training methods and test-time scaling techniques.

2. Benchmark Construction and Dataset Specification

GUIZoom-Bench is reorganized from the ScreenSpot-Pro test set, itself comprising approximately 3,000 high-resolution professional desktop screenshots annotated with fine-grained GUI element bounding boxes and natural language queries. The fixed zooming schedule and cropping protocol eliminate confounds, ensuring comparability across models:

  • Pre-zoom configuration: A 2×2 grid (K=4) over the full screenshot initializes the pre-zoom phase.
  • Zoom depth: Each sample undergoes up to T=4 zoom iterations.
  • Shrink ratio: At each zoom, the viewport contracts by ρ=0.5, halving both width and height.
  • Minimum crop size: No crop dimension falls below m=768 pixels.
  • Boundary handling: A “shift” strategy translates out-of-bounds crops back inside the image rather than resizing.

Sample categories emerge from execution of a strong zoom-in pipeline (UI-Venus-72B with ZoomClick) across the dataset with the prescribed 4-step schedule. For each sample, a correctness vector s1s4s_1\ldots s_4 is recorded, where st=1s_t=1 if the click at depth tt lands inside the ground-truth bounding box, and $0$ otherwise.

The axis of difficulty is defined by the iteration of first correctness (t=1t=1 is easy, t>1t>1 is hard), while reliability is determined by sustained correctness in subsequent zooms (remaining 1 is normal; any later 0 indicates the model was misled). The crossing of these axes yields five behavioral categories (category statistics with counts and percentages follow):

Category Count Percentage
easy_normal 688 23%
easy_mislead 512 17%
hard_normal 914 30%
hard_mislead 704 23%
hard_est 182 6%

3. Task Definitions and Experimental Protocols

GUIZoom-Bench retains the canonical GUI grounding task: given a screenshot II and a language query qq, a grounding model GG predicts a normalized point p^=(x^,y^)[0,1]2\hat{p} = (\hat{x}, \hat{y}) \in [0,1]^2 in each zoomed viewport. The output is mapped into absolute pixel coordinates within the original screenshot via the viewport coordinates Vt=(vx1,vy1,vx2,vy2)V_t = (v_x^1, v_y^1, v_x^2, v_y^2):

ppx(t)=(round(W(vx1+(vx2vx1)x^)),  round(H(vy1+(vy2vy1)y^)))p_{\mathrm{px}}^{(t)} = \bigl(\, \mathrm{round}(W \cdot (v_x^1+(v_x^2-v_x^1)\hat x)),\; \mathrm{round}(H \cdot (v_y^1+(v_y^2-v_y^1)\hat y)) \bigr)

The standardized test-time protocol is as follows:

  1. Start with the full-image viewport V0=(0,0,1,1)V_0 = (0,0,1,1).
  2. Pre-zoom: run GG on the full image and its four non-overlapping quadrants. Select either the global or quadrant prediction according to whether any quadrant’s click is within τ=50\tau=50 pixels of the full-image click.
  3. Iteratively crop the viewport with shrink ratio $0.5$, ensuring no crop dimension falls below $768$ pixels; shift crops back within bounds as necessary.
  4. Repeat for T=4T=4 steps, recording the correctness vector s1s4s_1\ldots s_4.
  5. Assign each sample to one of the five behavioral categories based on observed correctness sequence.

4. Behavioral Analysis and Category Taxonomy

GUIZoom-Bench partitions the entire benchmark into five interpretable categories according to the sequence of correctness outcomes across depths:

  • easy_normal: The model is correct at the first iteration and remains correct throughout all subsequent zooms (s1=1s_1=1 and s24=1s_2–4=1).
  • easy_mislead: The model is correct at the start, but becomes incorrect at one or more deeper zooms (s1=1s_1=1, some later st=0s_t=0).
  • hard_normal: The first correct localization occurs at depth t>1t>1 and remains correct thereafter (s1=0s_1=0, first $1$ at t>1t>1, no subsequent $0$s).
  • hard_mislead: The first correct localization occurs at t>1t>1 but is subsequently lost at deeper zooms (s1=0s_1=0, first $1$ at t>1t>1, followed by a $0$).
  • hard_est: The model never localizes correctly up to the maximum zoom depth (s14=0s_1–4=0).

This behavioral taxonomy exposes difficulty-reliability trade-offs and pinpoints both headroom and pathological regimes for model improvement.

5. Evaluation Metrics

The benchmark deploys three main metric families:

  • Per-iteration correctness: sts_t, with st=1s_t=1 indicating inclusion of the predicted click at depth tt within the ground-truth bounding box.
  • Success Rate (grounding accuracy):

SuccessRate@t=i=1Nsi,tN\mathrm{SuccessRate}@t = \frac{\sum_{i=1}^N s_{i,t}}{N}

Success rates are measured per depth (t=1,2,3,4t=1,2,3,4) and overall (mean across depths or the single-shot baseline at t=1t=1).

  • Intersection-over-Union (IoU):

IoU(Bpred,Bgt)=area(BpredBgt)area(BpredBgt)\mathrm{IoU}(B_{\mathrm{pred}}, B_{\mathrm{gt}}) = \frac{\mathrm{area}(B_{\mathrm{pred}}\,\cap\,B_{\mathrm{gt}})} {\mathrm{area}(B_{\mathrm{pred}}\,\cup\,B_{\mathrm{gt}})}

IoU is not required by GUIZoom-Bench but available for further analysis—e.g., under an IoU0.5\geq 0.5 criterion, for compatibility with existing benchmarks.

Behavioral breakdowns report per-category success rates at each depth, illustrating characteristic transitions, e.g., large accuracy boosts between depths 1 and 2 for hard_normal samples, or drop-offs after depth 1 in easy_mislead cases.

6. Baseline Results and Interpretive Insights

Benchmark results enable direct, depth-by-depth, and category-specific comparisons among models such as Qwen3-VL-32B, UI-Venus-7B, and UI-Venus-72B. Sample success rates (depths 1–4) for each category and model are summarized:

Model easy_normal easy_mislead hard_normal hard_mislead hard_est
Qwen3-VL-32B 56/92/92/92 24/36/45/48 30/70/67/67 22/46/46/54 10/27/26/25
UI-Venus-7B 73/84/86/86 31/38/40/43 28/47/55/58 19/35/35/30 9/15/16/16
UI-Venus-72B 82/92/93/92 34/51/70/85 65/85/76/72 27/45/46/54 13/27/26/25

Deep zoom notably benefits hard_normal samples (e.g., UI-Venus-72B: 65\rightarrow85% between depths 1 and 2), confirming that zooming facilitates accurate localization of small or visually crowded targets. Specialized UI models (UI-Venus) often plateau after zoom step 2–3 and can be misled on easy samples by loss of contextual cues. General-purpose vision-LLMs (e.g., Qwen3-VL-32B) tend to improve until depth 3 and degrade only modestly at maximal zoom, reflecting stronger priors for global context. The “hard_est” category, where no correct localization is achieved at any zoom, remains below 20% across all models, defining the current upper limit for zoom-only test-time improvement.

7. Implications and Future Directions

GUIZoom-Bench highlights key areas for research advancement:

  • The systematic exposure of when and why zooming aids, impairs, or fails to impact GUI grounding signals crucial directions for zoom-aware training protocols, such as multi-resolution supervision and dynamic resizing, to reduce the hard_est regime.
  • The prevalence of mislead events at higher zoom levels suggests a need for learned stopping rules or adaptive shrink ratios instead of static zoom schedules.
  • Persistent failures even with correct visual evidence (notably for relational queries using terms such as “first” or “last”) imply the benefit of augmenting linguistic reasoning capabilities to further enhance GUI agent performance.

GUIZoom-Bench establishes zooming not as a black-box heuristic but as a quantifiable, interpretable dimension for GUI grounding model evaluation and optimization, enabling principled progress in both model architecture and training strategy research (Jiang et al., 5 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to GUIZoom-Bench.