GUIZoom-Bench: A Zoom Impact Benchmark
- GUIZoom-Bench is a systematic benchmark that measures the effect of zoom on GUI grounding, employing fixed protocols and metrics to assess model performance.
- It defines key metrics such as per-iteration correctness and categorizes samples into behavioral types to diagnose when zoom aids or hinders localization.
- The benchmark’s rigorous methodology and detailed analysis offer actionable insights for developing adaptive zoom-aware training and inference strategies.
GUIZoom-Bench is a systematic benchmark designed to quantify and characterize the impact of zooming as a prior in graphical user interface (GUI) grounding tasks. Developed from the need to move beyond ad-hoc use of zoom-in and progressive cropping in GUI grounding, GUIZoom-Bench provides rigorous protocols, metrics, and behavioral categorizations. Its primary purpose is to enable researchers to diagnose, interpret, and compare model adaptability to zoom at both per-sample and population levels, thereby informing the design of advanced zoom-aware training and test-time inference strategies (Jiang et al., 5 Dec 2025).
1. Motivation and Conceptual Foundations
The under-explored potential of zoom in GUI grounding has been recognized, yet until the advent of GUIZoom-Bench, there was no framework enabling systematic measurement of when zoom benefits or harms model performance. Standard benchmarks failed to answer fundamental questions: when does zoom facilitate localization, when does it degrade it by removing necessary context or introducing distractors, and to what extent can a model’s performance be maximized through aggressive zoom application.
GUIZoom-Bench is intended to:
- Quantify per-sample “first correctness depth” (the number of zoom steps until the model first localizes the target element correctly).
- Measure “stability under further zoom” (whether correct localization persists or if the model becomes misled at deeper zooms).
- Partition samples into interpretable categories that reveal “difficulty × reliability” trade-offs.
- Serve as a diagnostic tool for evaluating zoom-aware training methods and test-time scaling techniques.
2. Benchmark Construction and Dataset Specification
GUIZoom-Bench is reorganized from the ScreenSpot-Pro test set, itself comprising approximately 3,000 high-resolution professional desktop screenshots annotated with fine-grained GUI element bounding boxes and natural language queries. The fixed zooming schedule and cropping protocol eliminate confounds, ensuring comparability across models:
- Pre-zoom configuration: A 2×2 grid (K=4) over the full screenshot initializes the pre-zoom phase.
- Zoom depth: Each sample undergoes up to T=4 zoom iterations.
- Shrink ratio: At each zoom, the viewport contracts by ρ=0.5, halving both width and height.
- Minimum crop size: No crop dimension falls below m=768 pixels.
- Boundary handling: A “shift” strategy translates out-of-bounds crops back inside the image rather than resizing.
Sample categories emerge from execution of a strong zoom-in pipeline (UI-Venus-72B with ZoomClick) across the dataset with the prescribed 4-step schedule. For each sample, a correctness vector is recorded, where if the click at depth lands inside the ground-truth bounding box, and $0$ otherwise.
The axis of difficulty is defined by the iteration of first correctness ( is easy, is hard), while reliability is determined by sustained correctness in subsequent zooms (remaining 1 is normal; any later 0 indicates the model was misled). The crossing of these axes yields five behavioral categories (category statistics with counts and percentages follow):
| Category | Count | Percentage |
|---|---|---|
| easy_normal | 688 | 23% |
| easy_mislead | 512 | 17% |
| hard_normal | 914 | 30% |
| hard_mislead | 704 | 23% |
| hard_est | 182 | 6% |
3. Task Definitions and Experimental Protocols
GUIZoom-Bench retains the canonical GUI grounding task: given a screenshot and a language query , a grounding model predicts a normalized point in each zoomed viewport. The output is mapped into absolute pixel coordinates within the original screenshot via the viewport coordinates :
The standardized test-time protocol is as follows:
- Start with the full-image viewport .
- Pre-zoom: run on the full image and its four non-overlapping quadrants. Select either the global or quadrant prediction according to whether any quadrant’s click is within pixels of the full-image click.
- Iteratively crop the viewport with shrink ratio $0.5$, ensuring no crop dimension falls below $768$ pixels; shift crops back within bounds as necessary.
- Repeat for steps, recording the correctness vector .
- Assign each sample to one of the five behavioral categories based on observed correctness sequence.
4. Behavioral Analysis and Category Taxonomy
GUIZoom-Bench partitions the entire benchmark into five interpretable categories according to the sequence of correctness outcomes across depths:
- easy_normal: The model is correct at the first iteration and remains correct throughout all subsequent zooms ( and ).
- easy_mislead: The model is correct at the start, but becomes incorrect at one or more deeper zooms (, some later ).
- hard_normal: The first correct localization occurs at depth and remains correct thereafter (, first $1$ at , no subsequent $0$s).
- hard_mislead: The first correct localization occurs at but is subsequently lost at deeper zooms (, first $1$ at , followed by a $0$).
- hard_est: The model never localizes correctly up to the maximum zoom depth ().
This behavioral taxonomy exposes difficulty-reliability trade-offs and pinpoints both headroom and pathological regimes for model improvement.
5. Evaluation Metrics
The benchmark deploys three main metric families:
- Per-iteration correctness: , with indicating inclusion of the predicted click at depth within the ground-truth bounding box.
- Success Rate (grounding accuracy):
Success rates are measured per depth () and overall (mean across depths or the single-shot baseline at ).
- Intersection-over-Union (IoU):
IoU is not required by GUIZoom-Bench but available for further analysis—e.g., under an IoU criterion, for compatibility with existing benchmarks.
Behavioral breakdowns report per-category success rates at each depth, illustrating characteristic transitions, e.g., large accuracy boosts between depths 1 and 2 for hard_normal samples, or drop-offs after depth 1 in easy_mislead cases.
6. Baseline Results and Interpretive Insights
Benchmark results enable direct, depth-by-depth, and category-specific comparisons among models such as Qwen3-VL-32B, UI-Venus-7B, and UI-Venus-72B. Sample success rates (depths 1–4) for each category and model are summarized:
| Model | easy_normal | easy_mislead | hard_normal | hard_mislead | hard_est |
|---|---|---|---|---|---|
| Qwen3-VL-32B | 56/92/92/92 | 24/36/45/48 | 30/70/67/67 | 22/46/46/54 | 10/27/26/25 |
| UI-Venus-7B | 73/84/86/86 | 31/38/40/43 | 28/47/55/58 | 19/35/35/30 | 9/15/16/16 |
| UI-Venus-72B | 82/92/93/92 | 34/51/70/85 | 65/85/76/72 | 27/45/46/54 | 13/27/26/25 |
Deep zoom notably benefits hard_normal samples (e.g., UI-Venus-72B: 6585% between depths 1 and 2), confirming that zooming facilitates accurate localization of small or visually crowded targets. Specialized UI models (UI-Venus) often plateau after zoom step 2–3 and can be misled on easy samples by loss of contextual cues. General-purpose vision-LLMs (e.g., Qwen3-VL-32B) tend to improve until depth 3 and degrade only modestly at maximal zoom, reflecting stronger priors for global context. The “hard_est” category, where no correct localization is achieved at any zoom, remains below 20% across all models, defining the current upper limit for zoom-only test-time improvement.
7. Implications and Future Directions
GUIZoom-Bench highlights key areas for research advancement:
- The systematic exposure of when and why zooming aids, impairs, or fails to impact GUI grounding signals crucial directions for zoom-aware training protocols, such as multi-resolution supervision and dynamic resizing, to reduce the hard_est regime.
- The prevalence of mislead events at higher zoom levels suggests a need for learned stopping rules or adaptive shrink ratios instead of static zoom schedules.
- Persistent failures even with correct visual evidence (notably for relational queries using terms such as “first” or “last”) imply the benefit of augmenting linguistic reasoning capabilities to further enhance GUI agent performance.
GUIZoom-Bench establishes zooming not as a black-box heuristic but as a quantifiable, interpretable dimension for GUI grounding model evaluation and optimization, enabling principled progress in both model architecture and training strategy research (Jiang et al., 5 Dec 2025).