GBT: GUI Grounding Evaluation Toolkit

Updated 24 November 2025

GBT is a modular toolkit for reproducible evaluation of GUI grounding pipelines, decoupling ROI deduction from fine-grained element localization.
It leverages plug-and-play agent interfaces and supports systematic hyperparameter sweeps with built-in support for key benchmarks like ScreenSpot-Pro and OSWorld-G.
The toolkit integrates standardized data processing, comprehensive metric computation, and clear reporting to facilitate rigorous comparisons and extensibility.

The Grounding Benchmark Toolkit (GBT) is a modular, extensible suite for reproducible, end-to-end evaluation of GUI element grounding systems. Designed to fill the need for standardized, open, and rigorous benchmarking in natural language–to–GUI mapping, GBT operationalizes a two-stage pipeline—Region-of-Interest (ROI) deduction followed by fine-grained element grounding—integrating flexible agent interfaces, systematic hyperparameter sweeps, and comprehensive metric computation. With out-of-the-box support for the ScreenSpot-Pro and OSWorld-G benchmarks, as well as utility code for experimentation and extension, GBT enables systematic development, comparison, and integration of advanced grounding strategies and VLM architectures (Kwak et al., 17 Nov 2025).

1. High-Level Design and Architecture

GBT is architected to provide standardized tooling and modular interfaces for the evaluation of GUI grounding pipelines. The toolkit is strictly an evaluation harness—no training modules are present; all benchmarks are fixed test sets, with agents acting as black-box APIs or models.

Primary goals:

Provide canonical data handling, preprocessing, metrics, and reporting for GUI grounding.
Support rigorous evaluation across data, models, and hyperparameter configurations.
Decouple architectural stages: Stage 1 (ROI deduction), Stage 2 (element localization), and auxiliary agents (refusal, rewriting, scaling).
Systematize experimentation over ROI sizes/strategies, upscaling factors, and instruction-rewriting schemas.

Core Components:

Module	Purpose	Key Features
DataLoader	Dataset ingestion, partitioning	Handles annotation types
Preprocessor	ROI cropping, bidirectional zoom, upscaling	Algorithm 1 implementation
Agent Interface	Standard method signatures for agent predictions	Plug-and-play APIs
MetricCalculator	Localization/task metrics implementation	Accuracy, IoU, etc.
ExperimentRunner	Pipeline orchestration, grid-search, sweeps	Staged workflows
ReportGenerator	Export/visualize results, plot curves	CSV/JSON, plotting tools

Interaction Flow:

Dataset load (DataLoader)
(Optional) infeasibility filtering (RefusalAgent, OSWorld-G only)
ROI prediction (ROIAgent)
(Optional) ROI upscaling (ScaleAgent)
(Optional) instruction rewriting (RewriteAgent)
Fine-grained point prediction (GroundingAgent)
Metric computation (MetricCalculator)
Aggregation and export (ReportGenerator) (Kwak et al., 17 Nov 2025).

2. Supported Datasets, Preprocessing, and Agent Protocol

GBT provides standardized access and preprocessing for two core GUI grounding datasets:

ScreenSpot-Pro (SSP):
- 4K-resolution images of professional UIs
- 1,581 samples with axis-aligned bounding box annotations
OSWorld-G (OSG):
- 1080p OS interface screenshots
- 564 instructions (510 feasible, 54 infeasible) with polygonal annotations

Dataset and Preprocessing Steps:

Polygons are converted to minimal enclosing rectangles for IoU computation.
All images normalized to $[0,1]$ or optionally to model-specific conventions.
Optional precomputation of Stage 1 ROI crops to accelerate repeated Stage 2 evaluations (Kwak et al., 17 Nov 2025).

Agent Interface:

Agents are defined via Python base classes exposing .predict(instruction, image) → (x, y) or a refusal decision. Plug-in adapters exist for a wide range of VLMs and LLMs, such as GPT-4o, Gemini 2.5 Pro, UI-TARS, Qwen-VL, and others. ROIAgents, GroundingAgents, ScaleAgents, RewriteAgents, and RefusalAgents can be composed into multi-stage, arbitrary-length pipelines (Kwak et al., 17 Nov 2025).

3. Metrics and Result Computation

GBT implements a suite of canonical metrics, enabling transparent and nuanced evaluation over all pipeline stages. All metrics are computed using ground-truth references and GBT’s explicit rules for correctness and overlap resolution.

Primary Metrics:

Metric	Formula
Intersection-over-Union (IoU)	$IoU = \frac{ \|B_p \cap B_{gt}\| }{ \|B_p \cup B_{gt}\| }$
Localization Accuracy (Top-1)	$Accuracy = \frac{\#\{p : p \in B_{gt}\}}{\#\,\text{samples}}$
Containment Rate (Stage 1)	$Containment = \frac{\#\{\text{ROIs enclosing ground-truth}\}}{N}$
Composite Score	$Composite(S) = Containment(S) \times Accuracy \|_{S}$
Refusal Accuracy (OSG only)	$RefusalAcc = \frac{\#\,\{\text{infeasible tasks correctly refused}\}}{\#\,\text{infeasible tasks}}$
False Positive Rate (FPR)	$FPR = \frac{\#\,\{\text{feasible tasks incorrectly refused}\}}{\#\,\text{feasible tasks}}$

Ambiguity and overlap are handled by matching the click point to the instruction’s ground-truth label. Full per-sample per-stage latency is logged for API and inference timing (Kwak et al., 17 Nov 2025).

4. Experimental Protocol and Usage

All experiments in GBT are strictly evaluation-only, not involving further training or data splitting.

Datasets: Fixed test sets with no train/validation splits.
Models: Black-box (cloud APIs or local) agents.
Hyperparameters: ROI zoom $(\Delta_{in}=0.10,\,\Delta_{out}=0.05)$ , maximum zoom steps $(E_{max}=5)$ , minimum ROI size $(S_{min}=1000^2\;\text{px})$ , upscaling (default $3\times$ bicubic), rewrite strategies (e.g., “Context-Aware” prompts).
Ambiguity handling: Overlapping elements resolved by referent label.
Latency: Full API call timing per stage is recorded (Kwak et al., 17 Nov 2025).

Sample Usage

Command-Line:

python -m gbt.run_evaluation \
  --dataset ScreenSpot-Pro \
  --roi_zoom bidirectional \
  --roi_size 1000 \
  --scale 3 \
  --rewrite context_aware \
  --grounder ui-tars-72b \
  --output results/ssp_ui72b.json

Python API:

from gbt import BenchmarkRunner, Agents
runner = BenchmarkRunner(
    dataset="OSWorld-G",
    roi_agent=Agents.BidirectionalZoom(
        delta_in=0.1, delta_out=0.05, E_max=5,
        S_min=1000**2, N_stable=3, eps=50
    ),
    scaler=Agents.BicubicScaler(factor=3),
    rewriter=Agents.ContextAwareRewriter(),
    grounder=Agents.UITARS72B(local=True),
    refuser=Agents.AdvancedRefuser(),
)
results = runner.run()
print("Accuracy:", results.accuracy)
results.to_csv("osg_report.csv")

(Kwak et al., 17 Nov 2025).

5. Extensibility and Integration

GBT is designed for maximal extensibility via configuration files (YAML/JSON) and clean Pythonic interfaces for agents. New models, agents, or strategies can be integrated by implementing standard .predict methods. Multi-stage pipelines are supported, with arbitrary stage depth.

Integration with MEGA-GUI: MEGA-GUI’s multi-agent pipeline instantiates custom ROIAgents (BidirectionalZoom), ScaleAgents (ConservativeScale), and RewriteAgents (ContextAwareRewrite), using GBT as the backbone evaluation suite.
Example agent interfaces:
- class ROIAgent: predict_roi(instruction, image) → image_crop, bb
- class GroundingAgent: predict_point(instruction, image_crop) → (x, y)
Any researcher can swap in new VLMs (e.g., adapters for Llama, Gemini, or custom models), control all experiment parameters, and extend to new benchmarks by providing dataset handlers and annotation schemas (Kwak et al., 17 Nov 2025).

6. Quantitative Results and Ablation Findings

GBT enables fully reproducible experiments and thorough ablation. The MEGA-GUI reference pipeline (Gemini 2.5 Pro for bidirectional zoom, $3\times$ scaling, GPT-4o rewriter, UI-TARS-72B grounder) achieves:

ScreenSpot-Pro: 73.18% Top-1 accuracy
OSWorld-G: 68.63% Top-1 accuracy

Ablation studies demonstrate:

Optimal ROI size of 1000 px, found via grid search (400–1800 px).
Rewrite prompt variants yielding +6 percentage-point gain with structured output.
Upscaling from 1× to 3× provides +2.4 percentage-point accuracy gain.
Pruning rates modulate step count vs. pass rate.
Running the full two-stage pipeline consistently outperforms monolithic approaches (Kwak et al., 17 Nov 2025).

7. Impact and Future Directions

GBT establishes a new standard for GUI grounding evaluations, emphasizing modularity, reproducibility, and clarity of metric definitions. The decoupled pipeline and agent-based architecture allow systematic paper of where grounding models fail, how auxiliary modules (rewriters, scalers) contribute, and the role of context in ambiguous instructions. By enabling plug-and-play integration, GBT is positioned to accelerate research in multimodal grounding, VLM benchmarking, and downstream applications such as autonomy, accessibility, or GUI automation.

A plausible implication is that similar toolkit architectures—comprising canonical data pipelines, explicit agent interfaces, and fine-grained staged evaluation—will be essential across related language grounding domains, especially as the multi-agent modular paradigm supersedes monolithic VLM evaluation (Kwak et al., 17 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Grounding Benchmark Toolkit (GBT).