OSWorld-G: GUI Grounding Benchmark

Updated 10 October 2025

OSWorld-G is a comprehensive benchmark designed to evaluate GUI grounding using 564 annotated real-world samples covering 32 distinct UI element types.
It challenges models with diverse subtasks including text matching, spatial layout reasoning, element recognition, fine-grained manipulation, and refusal behavior.
Integration with the large-scale Jedi dataset (4 million examples) has led to significant accuracy gains and enhanced operational robustness in autonomous agents.

OSWorld-G is a comprehensive benchmark and dataset designed to rigorously evaluate and advance the state of graphical user interface (GUI) grounding for autonomous computer-use agents. By capturing the diversity and complexity of real user interactions across varied interface elements, OSWorld-G addresses longstanding deficiencies in existing benchmarks, revealing key technical challenges and fostering progress in operational multimodal agent research.

1. Definition and Scope

OSWorld-G is a curated benchmark comprising 564 finely annotated samples targeting realistic, open-domain GUI grounding tasks. Each sample is labeled with one of 32 unique UI element types and annotated for granular capabilities including text matching, element recognition, spatial layout reasoning, fine-grained manipulation, and refusal behavior (handling infeasible instructions). Grounding is evaluated by spatial containment: predicted action coordinates must fall within the gold bounding box. This framework tests an agent’s ability to bridge language understanding, visual recognition, operational knowledge, and mouse/keyboard control within authentic GUI scenarios (Xie et al., 19 May 2025).

Distinct from its predecessor OSWorld (Xie et al., 11 Apr 2024), which focuses on end-to-end computer task execution by multimodal agents, OSWorld-G isolates the critical grounding step—the mapping from natural language to precise GUI actions—enabling targeted analysis and improvement of this bottleneck. The benchmark also introduces refusal cases (where the commanded UI element is absent) as an explicit capability requirement.

2. Benchmark Structure and Task Typology

OSWorld-G’s annotation taxonomy systematically covers operationally meaningful GUI elements and subtasks:

Text Matching: Locating UI components via explicit text (e.g., “Save As” in menu bars).
Element Recognition: Identifying functional but non-textual elements (icons, checkboxes, sliders) based on visual and semantic cues.
Layout Understanding: Reasoning over spatial hierarchies (e.g., grasping that notification bars reside atop, or that certain controls nest within specific panels).
Fine-Grained Manipulation: Achieving sub-element spatial precision such as placing the cursor between characters or interacting with small UI affordances.
Refusal (Negative Grounding): Correctly identifying and rejecting instructions for absent or unreachable elements.

Each sample is annotated to ensure comprehensive coverage across both element categories and capability axes, enabling nuanced evaluation. The labeling protocol ensures that challenging real-world phenomena—such as occluded elements, non-standard layouts, or element function ambiguity—are included.

3. The Jedi Dataset: Scale, Synthesis, and Multi-Perspective Annotation

To support robust model development, the benchmark is paired with the Jedi dataset—the largest and most comprehensive resource of its kind, containing 4 million grounding examples. Jedi leverages multi-perspective decoupling:

Data Source Diversity: GUI icons are mined from open-source repositories; components are rendered from production UI libraries (Material UI, Ant Design); layouts are sampled from actual operating system screens.
Subtask Decomposition: Data is synthesized and annotated for distinct subtasks—icon captioning, component manipulation, layout reasoning—enabling focused learning of different grounding facets.
Refusal Data: Over 2.6 million refusal cases are generated by mismatching instructions with irrelevant screen contexts, expressly training models to handle infeasibility.

Annotation quality is certified using manual tools (e.g., CVAT) and automatic filtering (LLM-based), ensuring label precision. This multi-perspective decoupling (an Editor's term to indicate the separation of subtasks for compositional learning) is critical for enabling models to generalize to new, previously unseen GUI configurations—an ability widely demanded by real-world deployment but previously unsupported by existing datasets.

4. Model Evaluation and Performance Analysis

Fine-tuning models on Jedi results in state-of-the-art performance on OSWorld-G, as well as prior screen grounding benchmarks (ScreenSpot-v2, ScreenSpot-Pro). Empirical outcomes include:

Accuracy: On OSWorld, agentic success rates rise from 5% (for standard grounding modules) to 27% when enhanced with Jedi-trained models. Accuracy is calculated as the proportion of predicted actions with coordinates inside the labeled bounding box.
Capability Breakdown: Detailed tables report model performance gains across text matching, element recognition, layout understanding, and manipulation subtasks.
Refusal Proficiency: Inclusion of refusal cases enables models to reject infeasible instructions, substantially improving safe operation.
Qualitative Case Studies: Jedi-equipped agents demonstrate compositional reasoning, such as selecting the exact cell in a spreadsheet or disambiguating between visually similar icons—behaviors not reliably realized by earlier systems.

Ablation studies systematically quantify contributions from icons, layout, component, and refusal data streams, consistently finding that multitask training yields superior performance versus single-stream approaches.

5. Impact on Agentic Capabilities and Downstream Tasks

OSWorld-G and Jedi provide substantial improvements in agentic capacity for general foundation models tasked with executing complex computer operations. When the Jedi-trained grounding module is integrated with general-purpose planners (such as GPT-4o), the resulting system can:

Reliably translate complex language instructions to precise, contextually correct GUI actions.
Execute multi-step workflows with high operational safety, reducing repetitive failures and errant clicks.
Exhibit compositional generalization to unfamiliar or novel UI configurations, a prerequisite for modern automation agents across dynamic software landscapes.

Direct improvements are evidenced by a rise in full-task completion on OSWorld from 5% to 27% when Jedi is employed as the grounding core. Case studies confirm the system’s practical ability to handle fine manipulation (e.g., precisely placing cursors) and robustly differentiate functionally similar elements by context.

6. Open-Source Resources and Reproducibility

All benchmark tasks, Jedi dataset, model checkpoints, and codebases are released at https://osworld-grounding.github.io. Supplementary resources include data synthesis scripts, rendering utilities (e.g., Playwright), LLM prompting pipelines, and annotation tools. These enable reproducibility, facilitate baselines for future research, and support seamless extension to new datasets, model architectures, or integration into full-stack computer-use agents.

The benchmark's design explicitly encourages modular experimentation: researchers can probe specific grounding dimensions, conduct targeted ablation studies, and validate improvements across both legacy and emergent GUI environments.

7. Significance and Future Directions

OSWorld-G establishes a new standard for evaluating and developing GUI grounding in the context of open-ended, real-world computer use. Its methodological rigor—multi-perspective annotation, grounded refusal tasks, and focus on operationally meaningful coordination—addresses persistent gaps in legacy benchmarks. The demonstrated improvements, particularly regarding compositionality and agentic robustness, highlight the importance of large-scale, task-decomposed resources for achieving practical agent deployment.

Future work is anticipated to explore subsequent expansion of grounding categories (e.g., dynamic elements, adaptive layouts), increased automation in high-quality annotation, and tighter coupling between grounding and high-level agent planning, with the goal of achieving robust, safe, and general autonomous operation across real computer systems.