Bidirectional ROI Zoom Algorithm
- Bidirectional ROI Zoom Algorithm is a modular method that iteratively refines region proposals through dual-direction scaling for more accurate GUI element localization.
- It standardizes the ROI extraction process by integrating multi-stage preprocessing, agent interfaces, and configurable workflows to ensure reproducible results.
- Empirical evaluations demonstrate improved Top-1 accuracy and efficiency in grounding benchmarks, validating its superiority over one-shot approaches.
The Grounding Benchmark Toolkit (GBT) is an open-source, modular evaluation suite designed to standardize, reproduce, and extend empirical research on vision-language and instruction-following grounding tasks. Released alongside the MEGA-GUI framework, GBT provides a reference pipeline and API for rigorous benchmarking of natural language grounding in graphical user interfaces but is architected for extensibility across domains and models. GBT is characterized by principled data handling, multi-stage modular agent design, formally defined metrics, configurable workflows, and detailed reporting to facilitate reproducibility, ablation, and fair comparison in grounding research (Kwak et al., 17 Nov 2025).
1. Purpose, Scope, and Architecture
GBT was developed to address the absence of standardized, reproducible toolchains in GUI element grounding research, where prior systems often used closed, one-off pipelines with limited extensibility and unclear evaluation criteria. Its core goals are:
- Standardization: Unified loaders, preprocessing, metrics, and report generation for GUI grounding datasets, enabling direct side-by-side comparison of models and strategies.
- Reproducibility: Fixed test splits, no training or tuning on evaluation sets, explicit agent interfaces, and logging of all configuration and latency.
- Modularity: Decoupled multi-stage pipelines (e.g., ROI deduction, scaling, rewriting, fine grounding, refusal), with pluggable model or agent backends and hyperparameter grid-search.
- Extendability: Easily supports new models, agents, datasets, and experimental protocols via clear Python/YAML interfaces and config-driven agent registration.
The core software modules are:
| Module | Functionality | Examples |
|---|---|---|
| DataLoader | Loads and splits GUI images, instructions, and annotations | SSP (1,581), OSG (564) |
| Preprocessor | Crops/zooms ROIs, rescales ROI, and normalizes images | Bicubic upscaling, polygon→rectangle |
| Agent Interface | Abstract .predict API for all agent types | ROIAgent, GroundingAgent, RefusalAgent |
| MetricCalculator | Computes all grounding and pipeline metrics | IoU, Accuracy, Containment, FPR, F1 |
| ExperimentRunner | Orchestrates pipeline, config sweeps, logging | Stage 0 (refusal) … Stage 2 (grounding) |
| ReportGenerator | Aggregates metrics, plots trade-offs, outputs JSON/CSV | Containment vs. ROI, accuracy curves |
Each agent operates as a black-box function from instruction and image to an artifact (e.g., ROI crop, point, refusal). Pipelines can be arbitrarily deep, supporting complex cascades for advanced multi-agent or multi-strategy research.
2. Supported Datasets, Preprocessing, and Workflow
The initial release of GBT focuses on two benchmarks:
- ScreenSpot-Pro (SSP): 4K-resolution professional GUIs, 1,581 natural language instructions, axis-aligned bounding boxes for annotation.
- OSWorld-G (OSG): 1080p OS-style GUI screenshots, 564 instructions (510 feasible + 54 infeasible), polygons as annotations.
Preprocessing routines convert polygonal ground-truth to minimal enclosing rectangles for IoU computations, normalize pixel values, and optionally precompute Stage 1 ROI crops to speed repeated runs. For overlapping or ambiguous regions, GBT consistently resolves prediction correctness via ground-truth label checks rather than spatial heuristics.
The workflow comprises:
- Load the dataset split (no train/val, fixed test sets).
- (Optional) Run RefusalAgent to filter infeasible instructions.
- Call ROIAgent to crop the initial ROI (e.g., 1000 px square).
- (Optional) Apply ScaleAgent for bicubic upscaling (default 3×).
- (Optional) Rewrite instruction via RewriteAgent for ambiguity.
- Run GroundingAgent to output prediction (x, y) within ROI.
- Compute and log metrics.
- Aggregate and export full results.
3. Evaluation Metrics and Experimental Protocol
Formal metric definitions (using LaTeX notation) in GBT ensure results are interpretable and consistent:
- Intersection-over-Union (IoU):
where is the predicted bounding box, the ground-truth region.
- Localization Accuracy (Top-1):
- Containment Rate (Stage 1 only):
- Composite Score (for ROI of size ):
- Refusal Accuracy and False Positive Rate (Stage 0, OSG only):
No training is performed in GBT; it is exclusively an evaluation harness. All agents are used as black-box APIs or local models, with their configuration (e.g., ROI sizes, scaling factors, rewriting prompts) controlled in a central config. All per-stage API call latencies are logged for inference cost estimation.
When a sampled click lands within multiple overlapping GUI elements, tie-break is performed using the instruction’s labeled target. Experimental ablations include ROI size sweeps, pruning rate variation, structured prompt variants (with documented +6 pp accuracy), scaling factor grids (with +2.40 pp at 3×), and reporting of per-agent failure characteristics.
4. Extensibility and Integration
GBT’s extensibility is realized through both software interface design and flexible pipeline configuration. YAML or JSON config files define agent/module selection, hyperparameters, and experimental protocol. Agent APIs are standardized:
1 2 3 4 |
class ROIAgent: def predict_roi(self, instruction, image) -> (image_crop, bounding_box) class GroundingAgent: def predict_point(self, instruction, image_crop) -> (x, y) |
This design allows researchers to drop in new VLM adapters or reasoning modules (e.g., GPT-4o, Gemini 2.5 Pro, Qwen-VL, UI-TARS, or model ensembles) with minimal integration burden. The underlying architecture supports pipelines of arbitrary length—enabling workflows with multiple cascading modules for grounded instruction following, refusal handling, context rewriting, and context-aware scaling.
GBT directly orchestrates the MEGA-GUI pipeline, instantiating MEGA-GUI’s agents (BidirectionalZoom, ConservativeScale, ContextAwareRewrite) as modules. Researchers can augment or replace these with new strategies or models to test algorithmic innovations or application to other datasets and tasks.
5. Usage Scenarios and Code Examples
GBT provides both a CLI and Python API for ease of deployment:
Command-Line Example:
1 2 3 4 5 6 7 8 |
python -m gbt.run_evaluation \ --dataset ScreenSpot-Pro \ --roi_zoom bidirectional \ --roi_size 1000 \ --scale 3 \ --rewrite context_aware \ --grounder ui-tars-72b \ --output results/ssp_ui72b.json |
Python API Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from gbt import BenchmarkRunner, Agents runner = BenchmarkRunner( dataset="OSWorld-G", roi_agent=Agents.BidirectionalZoom(**{ "delta_in":0.1, "delta_out":0.05, "E_max":5, "S_min":1000**2, "N_stable":3, "eps":50 }), scaler=Agents.BicubicScaler(factor=3), rewriter=Agents.ContextAwareRewriter(), grounder=Agents.UITARS72B(local=True), refuser=Agents.AdvancedRefuser() ) results = runner.run() print("Accuracy:", results.accuracy) results.to_csv("osg_report.csv") |
6. Quantitative Results and Empirical Findings
On the default MEGA-GUI pipeline (Gemini 2.5 Pro for ROI zoom, 3× scaling, GPT-4o rewriter, UI-TARS-72B grounder), GBT yields:
- ScreenSpot-Pro: 73.18% Top-1 accuracy (point-in-box)
- OSWorld-G: 68.63% Top-1 accuracy
Ablations reported in GBT include:
- ROI Size: Grid search (400–1800 px) identifies 1000 px as optimal.
- Pruning Rate: 10–30% sweep, trading off steps vs. pass rate.
- Rewrite Prompts: Structured outputs confer +6 percentage points in accuracy.
- Scaling: 1–4×, with 3× upscaling providing +2.40 percentage points.
All results use the explicit accuracy and containment formulas above. By exposing pipeline trade-offs and enabling controlled sweeps, GBT supports nuanced error analysis and robust benchmarking under varying pipeline settings.
7. Significance and Broader Impact
GBT exemplifies a shift in grounding research toward reproducible, modular, and extensible empirical science. By standardizing experiment structure, evaluation, and reporting, it enables transparent cross-paper comparisons and facilitates the rapid integration of new models and strategies. The toolkit’s metrics rigor and compositional pipeline architecture allow identification of strengths and weaknesses at each pipeline stage, supporting targeted architectural and methodological innovation.
A plausible implication is the adoption of GBT (and similar toolkits in other grounding domains, e.g., ChartAB or Rifts) as a de facto standard for future grounding work in both academic and industrial contexts. Researchers are thus encouraged to integrate their agents and datasets into the GBT framework to advance methodological transparency and scientific progress in multimodal, interactive, and instruction-following AI (Kwak et al., 17 Nov 2025).