Papers
Topics
Authors
Recent
2000 character limit reached

Bidirectional ROI Zoom Algorithm

Updated 24 November 2025
  • Bidirectional ROI Zoom Algorithm is a modular method that iteratively refines region proposals through dual-direction scaling for more accurate GUI element localization.
  • It standardizes the ROI extraction process by integrating multi-stage preprocessing, agent interfaces, and configurable workflows to ensure reproducible results.
  • Empirical evaluations demonstrate improved Top-1 accuracy and efficiency in grounding benchmarks, validating its superiority over one-shot approaches.

The Grounding Benchmark Toolkit (GBT) is an open-source, modular evaluation suite designed to standardize, reproduce, and extend empirical research on vision-language and instruction-following grounding tasks. Released alongside the MEGA-GUI framework, GBT provides a reference pipeline and API for rigorous benchmarking of natural language grounding in graphical user interfaces but is architected for extensibility across domains and models. GBT is characterized by principled data handling, multi-stage modular agent design, formally defined metrics, configurable workflows, and detailed reporting to facilitate reproducibility, ablation, and fair comparison in grounding research (Kwak et al., 17 Nov 2025).

1. Purpose, Scope, and Architecture

GBT was developed to address the absence of standardized, reproducible toolchains in GUI element grounding research, where prior systems often used closed, one-off pipelines with limited extensibility and unclear evaluation criteria. Its core goals are:

  • Standardization: Unified loaders, preprocessing, metrics, and report generation for GUI grounding datasets, enabling direct side-by-side comparison of models and strategies.
  • Reproducibility: Fixed test splits, no training or tuning on evaluation sets, explicit agent interfaces, and logging of all configuration and latency.
  • Modularity: Decoupled multi-stage pipelines (e.g., ROI deduction, scaling, rewriting, fine grounding, refusal), with pluggable model or agent backends and hyperparameter grid-search.
  • Extendability: Easily supports new models, agents, datasets, and experimental protocols via clear Python/YAML interfaces and config-driven agent registration.

The core software modules are:

Module Functionality Examples
DataLoader Loads and splits GUI images, instructions, and annotations SSP (1,581), OSG (564)
Preprocessor Crops/zooms ROIs, rescales ROI, and normalizes images Bicubic upscaling, polygon→rectangle
Agent Interface Abstract .predict API for all agent types ROIAgent, GroundingAgent, RefusalAgent
MetricCalculator Computes all grounding and pipeline metrics IoU, Accuracy, Containment, FPR, F1
ExperimentRunner Orchestrates pipeline, config sweeps, logging Stage 0 (refusal) … Stage 2 (grounding)
ReportGenerator Aggregates metrics, plots trade-offs, outputs JSON/CSV Containment vs. ROI, accuracy curves

Each agent operates as a black-box function from instruction and image to an artifact (e.g., ROI crop, point, refusal). Pipelines can be arbitrarily deep, supporting complex cascades for advanced multi-agent or multi-strategy research.

2. Supported Datasets, Preprocessing, and Workflow

The initial release of GBT focuses on two benchmarks:

  • ScreenSpot-Pro (SSP): 4K-resolution professional GUIs, 1,581 natural language instructions, axis-aligned bounding boxes for annotation.
  • OSWorld-G (OSG): 1080p OS-style GUI screenshots, 564 instructions (510 feasible + 54 infeasible), polygons as annotations.

Preprocessing routines convert polygonal ground-truth to minimal enclosing rectangles for IoU computations, normalize pixel values, and optionally precompute Stage 1 ROI crops to speed repeated runs. For overlapping or ambiguous regions, GBT consistently resolves prediction correctness via ground-truth label checks rather than spatial heuristics.

The workflow comprises:

  1. Load the dataset split (no train/val, fixed test sets).
  2. (Optional) Run RefusalAgent to filter infeasible instructions.
  3. Call ROIAgent to crop the initial ROI (e.g., 1000 px square).
  4. (Optional) Apply ScaleAgent for bicubic upscaling (default 3×).
  5. (Optional) Rewrite instruction via RewriteAgent for ambiguity.
  6. Run GroundingAgent to output prediction (x, y) within ROI.
  7. Compute and log metrics.
  8. Aggregate and export full results.

3. Evaluation Metrics and Experimental Protocol

Formal metric definitions (using LaTeX notation) in GBT ensure results are interpretable and consistent:

  • Intersection-over-Union (IoU):

IoU=∣Bp∩Bgt∣∣Bp∪Bgt∣\text{IoU} = \frac{\lvert B_{p}\cap B_{gt}\rvert}{\lvert B_{p}\cup B_{gt}\rvert}

where BpB_p is the predicted bounding box, BgtB_{gt} the ground-truth region.

  • Localization Accuracy (Top-1):

Accuracy=# {predictions p:p∈Bgt}# total samples\text{Accuracy} = \frac{\#\,\{\text{predictions }p : p\in B_{gt}\}}{\#\,\text{total samples}}

  • Containment Rate (Stage 1 only):

Containment=# {ROIs enclosing the ground-truth}# total samples\text{Containment} = \frac{\#\,\{\text{ROIs enclosing the ground-truth}\}}{\#\,\text{total samples}}

  • Composite Score (for ROI of size SS):

Composite(S)=Containment(S)×Accuracy\text{Composite}(S) = \text{Containment}(S) \times \text{Accuracy}

  • Refusal Accuracy and False Positive Rate (Stage 0, OSG only):

RefusalAcc=# {infeasible tasks correctly refused}# infeasible tasks\text{RefusalAcc} = \frac{\#\,\{\text{infeasible tasks correctly refused}\}}{\#\,\text{infeasible tasks}}

FPR=# {feasible tasks incorrectly refused}# feasible tasks\text{FPR} = \frac{\#\,\{\text{feasible tasks incorrectly refused}\}}{\#\,\text{feasible tasks}}

No training is performed in GBT; it is exclusively an evaluation harness. All agents are used as black-box APIs or local models, with their configuration (e.g., ROI sizes, scaling factors, rewriting prompts) controlled in a central config. All per-stage API call latencies are logged for inference cost estimation.

When a sampled click lands within multiple overlapping GUI elements, tie-break is performed using the instruction’s labeled target. Experimental ablations include ROI size sweeps, pruning rate variation, structured prompt variants (with documented +6 pp accuracy), scaling factor grids (with +2.40 pp at 3×), and reporting of per-agent failure characteristics.

4. Extensibility and Integration

GBT’s extensibility is realized through both software interface design and flexible pipeline configuration. YAML or JSON config files define agent/module selection, hyperparameters, and experimental protocol. Agent APIs are standardized:

1
2
3
4
class ROIAgent:
    def predict_roi(self, instruction, image) -> (image_crop, bounding_box)
class GroundingAgent:
    def predict_point(self, instruction, image_crop) -> (x, y)

This design allows researchers to drop in new VLM adapters or reasoning modules (e.g., GPT-4o, Gemini 2.5 Pro, Qwen-VL, UI-TARS, or model ensembles) with minimal integration burden. The underlying architecture supports pipelines of arbitrary length—enabling workflows with multiple cascading modules for grounded instruction following, refusal handling, context rewriting, and context-aware scaling.

GBT directly orchestrates the MEGA-GUI pipeline, instantiating MEGA-GUI’s agents (BidirectionalZoom, ConservativeScale, ContextAwareRewrite) as modules. Researchers can augment or replace these with new strategies or models to test algorithmic innovations or application to other datasets and tasks.

5. Usage Scenarios and Code Examples

GBT provides both a CLI and Python API for ease of deployment:

Command-Line Example:

1
2
3
4
5
6
7
8
python -m gbt.run_evaluation \
  --dataset ScreenSpot-Pro \
  --roi_zoom bidirectional \
  --roi_size 1000 \
  --scale 3 \
  --rewrite context_aware \
  --grounder ui-tars-72b \
  --output results/ssp_ui72b.json
Supported flags include dataset (ScreenSpot-Pro/OSWorld-G), ROI strategy (none/one_shot/bidirectional), scale factor (1–4), rewrite strategy (none/context_aware/spatio_visual/structured), and agent backend.

Python API Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from gbt import BenchmarkRunner, Agents
runner = BenchmarkRunner(
    dataset="OSWorld-G",
    roi_agent=Agents.BidirectionalZoom(**{
        "delta_in":0.1, "delta_out":0.05, "E_max":5,
        "S_min":1000**2, "N_stable":3, "eps":50
    }),
    scaler=Agents.BicubicScaler(factor=3),
    rewriter=Agents.ContextAwareRewriter(),
    grounder=Agents.UITARS72B(local=True),
    refuser=Agents.AdvancedRefuser()
)
results = runner.run()
print("Accuracy:", results.accuracy)
results.to_csv("osg_report.csv")
This usage supports compositional experimentation for all aspects of the pipeline.

6. Quantitative Results and Empirical Findings

On the default MEGA-GUI pipeline (Gemini 2.5 Pro for ROI zoom, 3× scaling, GPT-4o rewriter, UI-TARS-72B grounder), GBT yields:

  • ScreenSpot-Pro: 73.18% Top-1 accuracy (point-in-box)
  • OSWorld-G: 68.63% Top-1 accuracy

Ablations reported in GBT include:

  • ROI Size: Grid search (400–1800 px) identifies 1000 px as optimal.
  • Pruning Rate: 10–30% sweep, trading off steps vs. pass rate.
  • Rewrite Prompts: Structured outputs confer +6 percentage points in accuracy.
  • Scaling: 1–4×, with 3× upscaling providing +2.40 percentage points.

All results use the explicit accuracy and containment formulas above. By exposing pipeline trade-offs and enabling controlled sweeps, GBT supports nuanced error analysis and robust benchmarking under varying pipeline settings.

7. Significance and Broader Impact

GBT exemplifies a shift in grounding research toward reproducible, modular, and extensible empirical science. By standardizing experiment structure, evaluation, and reporting, it enables transparent cross-paper comparisons and facilitates the rapid integration of new models and strategies. The toolkit’s metrics rigor and compositional pipeline architecture allow identification of strengths and weaknesses at each pipeline stage, supporting targeted architectural and methodological innovation.

A plausible implication is the adoption of GBT (and similar toolkits in other grounding domains, e.g., ChartAB or Rifts) as a de facto standard for future grounding work in both academic and industrial contexts. Researchers are thus encouraged to integrate their agents and datasets into the GBT framework to advance methodological transparency and scientific progress in multimodal, interactive, and instruction-following AI (Kwak et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Bidirectional ROI Zoom Algorithm.