GBT: GUI Grounding Evaluation Toolkit
- GBT is a modular toolkit for reproducible evaluation of GUI grounding pipelines, decoupling ROI deduction from fine-grained element localization.
- It leverages plug-and-play agent interfaces and supports systematic hyperparameter sweeps with built-in support for key benchmarks like ScreenSpot-Pro and OSWorld-G.
- The toolkit integrates standardized data processing, comprehensive metric computation, and clear reporting to facilitate rigorous comparisons and extensibility.
The Grounding Benchmark Toolkit (GBT) is a modular, extensible suite for reproducible, end-to-end evaluation of GUI element grounding systems. Designed to fill the need for standardized, open, and rigorous benchmarking in natural language–to–GUI mapping, GBT operationalizes a two-stage pipeline—Region-of-Interest (ROI) deduction followed by fine-grained element grounding—integrating flexible agent interfaces, systematic hyperparameter sweeps, and comprehensive metric computation. With out-of-the-box support for the ScreenSpot-Pro and OSWorld-G benchmarks, as well as utility code for experimentation and extension, GBT enables systematic development, comparison, and integration of advanced grounding strategies and VLM architectures (Kwak et al., 17 Nov 2025).
1. High-Level Design and Architecture
GBT is architected to provide standardized tooling and modular interfaces for the evaluation of GUI grounding pipelines. The toolkit is strictly an evaluation harness—no training modules are present; all benchmarks are fixed test sets, with agents acting as black-box APIs or models.
Primary goals:
- Provide canonical data handling, preprocessing, metrics, and reporting for GUI grounding.
- Support rigorous evaluation across data, models, and hyperparameter configurations.
- Decouple architectural stages: Stage 1 (ROI deduction), Stage 2 (element localization), and auxiliary agents (refusal, rewriting, scaling).
- Systematize experimentation over ROI sizes/strategies, upscaling factors, and instruction-rewriting schemas.
Core Components:
| Module | Purpose | Key Features |
|---|---|---|
| DataLoader | Dataset ingestion, partitioning | Handles annotation types |
| Preprocessor | ROI cropping, bidirectional zoom, upscaling | Algorithm 1 implementation |
| Agent Interface | Standard method signatures for agent predictions | Plug-and-play APIs |
| MetricCalculator | Localization/task metrics implementation | Accuracy, IoU, etc. |
| ExperimentRunner | Pipeline orchestration, grid-search, sweeps | Staged workflows |
| ReportGenerator | Export/visualize results, plot curves | CSV/JSON, plotting tools |
Interaction Flow:
- Dataset load (DataLoader)
- (Optional) infeasibility filtering (RefusalAgent, OSWorld-G only)
- ROI prediction (ROIAgent)
- (Optional) ROI upscaling (ScaleAgent)
- (Optional) instruction rewriting (RewriteAgent)
- Fine-grained point prediction (GroundingAgent)
- Metric computation (MetricCalculator)
- Aggregation and export (ReportGenerator) (Kwak et al., 17 Nov 2025).
2. Supported Datasets, Preprocessing, and Agent Protocol
GBT provides standardized access and preprocessing for two core GUI grounding datasets:
- ScreenSpot-Pro (SSP):
- 4K-resolution images of professional UIs
- 1,581 samples with axis-aligned bounding box annotations
- OSWorld-G (OSG):
- 1080p OS interface screenshots
- 564 instructions (510 feasible, 54 infeasible) with polygonal annotations
Dataset and Preprocessing Steps:
- Polygons are converted to minimal enclosing rectangles for IoU computation.
- All images normalized to or optionally to model-specific conventions.
- Optional precomputation of Stage 1 ROI crops to accelerate repeated Stage 2 evaluations (Kwak et al., 17 Nov 2025).
Agent Interface:
Agents are defined via Python base classes exposing .predict(instruction, image) → (x, y) or a refusal decision. Plug-in adapters exist for a wide range of VLMs and LLMs, such as GPT-4o, Gemini 2.5 Pro, UI-TARS, Qwen-VL, and others. ROIAgents, GroundingAgents, ScaleAgents, RewriteAgents, and RefusalAgents can be composed into multi-stage, arbitrary-length pipelines (Kwak et al., 17 Nov 2025).
3. Metrics and Result Computation
GBT implements a suite of canonical metrics, enabling transparent and nuanced evaluation over all pipeline stages. All metrics are computed using ground-truth references and GBT’s explicit rules for correctness and overlap resolution.
Primary Metrics:
| Metric | Formula |
|---|---|
| Intersection-over-Union (IoU) | |
| Localization Accuracy (Top-1) | |
| Containment Rate (Stage 1) | |
| Composite Score | |
| Refusal Accuracy (OSG only) | |
| False Positive Rate (FPR) |
Ambiguity and overlap are handled by matching the click point to the instruction’s ground-truth label. Full per-sample per-stage latency is logged for API and inference timing (Kwak et al., 17 Nov 2025).
4. Experimental Protocol and Usage
All experiments in GBT are strictly evaluation-only, not involving further training or data splitting.
- Datasets: Fixed test sets with no train/validation splits.
- Models: Black-box (cloud APIs or local) agents.
- Hyperparameters: ROI zoom , maximum zoom steps , minimum ROI size , upscaling (default bicubic), rewrite strategies (e.g., “Context-Aware” prompts).
- Ambiguity handling: Overlapping elements resolved by referent label.
- Latency: Full API call timing per stage is recorded (Kwak et al., 17 Nov 2025).
Sample Usage
Command-Line:
1 2 3 4 5 6 7 8 |
python -m gbt.run_evaluation \ --dataset ScreenSpot-Pro \ --roi_zoom bidirectional \ --roi_size 1000 \ --scale 3 \ --rewrite context_aware \ --grounder ui-tars-72b \ --output results/ssp_ui72b.json |
Python API:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from gbt import BenchmarkRunner, Agents runner = BenchmarkRunner( dataset="OSWorld-G", roi_agent=Agents.BidirectionalZoom( delta_in=0.1, delta_out=0.05, E_max=5, S_min=1000**2, N_stable=3, eps=50 ), scaler=Agents.BicubicScaler(factor=3), rewriter=Agents.ContextAwareRewriter(), grounder=Agents.UITARS72B(local=True), refuser=Agents.AdvancedRefuser(), ) results = runner.run() print("Accuracy:", results.accuracy) results.to_csv("osg_report.csv") |
5. Extensibility and Integration
GBT is designed for maximal extensibility via configuration files (YAML/JSON) and clean Pythonic interfaces for agents. New models, agents, or strategies can be integrated by implementing standard .predict methods. Multi-stage pipelines are supported, with arbitrary stage depth.
- Integration with MEGA-GUI: MEGA-GUI’s multi-agent pipeline instantiates custom ROIAgents (BidirectionalZoom), ScaleAgents (ConservativeScale), and RewriteAgents (ContextAwareRewrite), using GBT as the backbone evaluation suite.
- Example agent interfaces:
class ROIAgent: predict_roi(instruction, image) → image_crop, bbclass GroundingAgent: predict_point(instruction, image_crop) → (x, y)
- Any researcher can swap in new VLMs (e.g., adapters for Llama, Gemini, or custom models), control all experiment parameters, and extend to new benchmarks by providing dataset handlers and annotation schemas (Kwak et al., 17 Nov 2025).
6. Quantitative Results and Ablation Findings
GBT enables fully reproducible experiments and thorough ablation. The MEGA-GUI reference pipeline (Gemini 2.5 Pro for bidirectional zoom, scaling, GPT-4o rewriter, UI-TARS-72B grounder) achieves:
- ScreenSpot-Pro: 73.18% Top-1 accuracy
- OSWorld-G: 68.63% Top-1 accuracy
Ablation studies demonstrate:
- Optimal ROI size of 1000 px, found via grid search (400–1800 px).
- Rewrite prompt variants yielding +6 percentage-point gain with structured output.
- Upscaling from 1× to 3× provides +2.4 percentage-point accuracy gain.
- Pruning rates modulate step count vs. pass rate.
- Running the full two-stage pipeline consistently outperforms monolithic approaches (Kwak et al., 17 Nov 2025).
7. Impact and Future Directions
GBT establishes a new standard for GUI grounding evaluations, emphasizing modularity, reproducibility, and clarity of metric definitions. The decoupled pipeline and agent-based architecture allow systematic paper of where grounding models fail, how auxiliary modules (rewriters, scalers) contribute, and the role of context in ambiguous instructions. By enabling plug-and-play integration, GBT is positioned to accelerate research in multimodal grounding, VLM benchmarking, and downstream applications such as autonomy, accessibility, or GUI automation.
A plausible implication is that similar toolkit architectures—comprising canonical data pipelines, explicit agent interfaces, and fine-grained staged evaluation—will be essential across related language grounding domains, especially as the multi-agent modular paradigm supersedes monolithic VLM evaluation (Kwak et al., 17 Nov 2025).