Cross-Platform GUI Dataset Overview

Updated 10 March 2026

Cross-Platform GUI Datasets are curated collections of interface examples from various OSes and devices, enhancing model robustness.
They are built using techniques such as automated crawling, simulated rendering, native instrumentation, and crowdsourced annotation.
These datasets benchmark and advance GUI grounding, navigation, and cross-app generalization through structured annotations and scene graphs.

A cross-platform GUI dataset is a curated or automatically generated collection of graphical user interface (GUI) instances and related annotations that span multiple operating systems, device form factors, application types, and resolutions. Such datasets are essential for developing and evaluating robust, generalizable GUI agents, including GUI grounding models and instruction-following automation agents. The emergence of web-scale, multi-environment datasets has enabled significant progress toward platform-invariant understanding, goal-conditioned navigation, cross-application generalization, and hierarchical reasoning in autonomous GUI interaction research.

1. Concept and Motivation

Cross-platform GUI datasets are constructed to address the inherent heterogeneity in device environments—e.g., phones, tablets, desktops, web browsers—distinct UI paradigms (Material, Cupertino, native vs. web), and the evolving visual and structural characteristics of GUIs. Without comprehensive, multi-resolution corpora, models risk overfitting to specific device profiles or application layouts, resulting in poor transferability and limited applicability in real-world automation or assistive scenarios.

Key motivations include:

Enabling GUI grounding—localizing or identifying interface elements given natural language descriptions—across OSs and display resolutions.
Supporting cross-platform transferability: generalizing GUI skills (element selection, navigation) from training environments (e.g., Android) to previously unseen environments (e.g., iOS, Windows, Linux).
Facilitating cross-version and cross-app generalization: adapting to version updates, structural drift, or functional divergence in the interface (Lu et al., 23 May 2025).
Providing longitudinal benchmarks for robustness and overfitting detection in GUI-specialized vs. generalist multimodal models (Zhou et al., 18 Dec 2025).

2. Dataset Construction Methodologies

Modern cross-platform GUI datasets are created via a combination of automated web crawling, simulated device rendering, native app instrumentation, crowd-sourced annotation, and synthetic augmentation. Central approaches include:

Automated browser-based rendering: Large-scale datasets such as Insight-UI (Shen et al., 2024) are generated by rendering filtered Common Crawl HTML pages under multiple "device emulations" (iOS, Android, Windows, Linux) and resolutions using tools like Puppeteer. Random walk-based simulated interactions (clicks, scrolls, input) are executed, with each step yielding screenshots and associated DOM-derived metadata.
Native application emulation: Datasets such as those used for GUI grounding or planning (e.g., ZonUI-3B's AMEX source, Aguvis's AndroidWorld) employ frameworks like UIAutomator (Android), XCTest (iOS), and OS-specific automation APIs (pywinauto, ApplicationServices) to interact with installed apps, capturing screenshots and accessibility trees in situ (Hsieh et al., 30 Jun 2025, Xu et al., 2024).
Crowdsourcing and hybrid annotation: For high semantic fidelity, some datasets introduce verification or manual semantic grouping stages (e.g., GUILabeller in TransBench (Lu et al., 23 May 2025), multi-annotator pipelines in GUIDE (Chawla et al., 2024)).
Synthetic augmentation and VLM-augmented reasoning: Several datasets generate additional samples by synthesizing instructions, perturbations for negative sampling (refusal/overfitting probes), or planning trajectories with the help of vision-LLMs (e.g., Qwen2VL, GPT-4o) (Zhou et al., 18 Dec 2025, Gao et al., 22 Jun 2025).
Video-based extraction: MONDAY (Jang et al., 19 May 2025) demonstrates large-scale automated extraction of UI actions and transitions from instructional videos, leveraging OCR-based scene detection and object detection to assemble annotated action sequences in a scalable, cost-effective way.

3. Data Structures, Annotations, and Task Definitions

Cross-platform GUI datasets leverage highly structured annotation schemas, reflecting both the variety of tasks enabled and the depth of reasoning required.

Core records: Typically include lossless screenshot images (PNG), bounding boxes or polygons for element localization, element categories (buttons, text fields, icons, menus, etc.), natural-language grounding instructions, and interaction/action history.
Scene graphs: Insight-UI and similar datasets store DOM- or accessibility-based scene graphs for each frame, with node-level properties (type, bounds, clickability, visibility, OCR text) (Shen et al., 2024).
Interaction sequences and planning chains: For navigation/automation datasets, episodes consist of ordered screen–action pairs, with detailed action types (CLICK, SCROLL, TYPE, HOME, etc.), time-aligned multi-app traces (GUI Odyssey (Lu et al., 2024)), and agent memory states (short/long-term memory in GUI Odyssey-CoM (Gao et al., 22 Jun 2025)).
Grounding and reasoning tasks: Benchmarks such as VenusBench-GD define a six-fold taxonomy, splitting grounding into basic (element, spatial, visual) and advanced (reasoning, functional, refusal) tasks, each with aligned instructions and hierarchical evaluation (Zhou et al., 18 Dec 2025).
Standardized actions and normalization: Several corpora encode actions via Pythonic APIs (pyautogui, mobile API shims), and record both absolute and normalized coordinates to enable device/viewport-scale invariance (Xu et al., 2024, Wu et al., 2024).
Negative sampling and adversarial evaluation: OS-Oracle synthesizes errors including operation failure, inefficient recovery, premature/late termination, and inaccurate localization to support robust step-level critic model training (Wu et al., 18 Dec 2025).

4. Scale, Diversity, and Platform Coverage

State-of-the-art cross-platform GUI datasets are defined by their sweeping coverage, scale, and diversity of both data modalities and application scenarios.

Dataset	Screenshots	Platforms	UI Elements	Key Coverage
OS-Atlas (Wu et al., 2024)	2.24M	Web, Win, Linux, Mac, And	13.6M	Web + desktop + Android; OOD splits
ScaleCUA (Liu et al., 18 Sep 2025)	≈2.0M	Win, Linux, Mac, iOS, And, Web	19K traj, 17M actions	3 domains; a11y/DOM/acc tree metadata
Aguvis (Xu et al., 2024)	1.04M (grounding), 300K (planning)	Web/Win/Mac/Linux/And	—	Planning and grounding splits; normalized actions
MONDAY (Jang et al., 19 May 2025)	313K frames	iOS, Android	—	Video-to-dataset, instructional videos
Insight-UI (Shen et al., 2024)	1.46M	iOS/Android/Win/Linux(Web)	—	312K domains × 6 res; navigation w/o instruction
VenusBench-GD (Zhou et al., 18 Dec 2025)	2,172 screens	Web/Mobile/Desktop (EN/CN)	43K+ candidates	6,166 gold pairs covering 13 element types
ZonUI-3B (Hsieh et al., 30 Jun 2025)	24,100	Mobile/Desktop/Web	480K	Multi-res, multi-platform; fine-tuning protocol

Typical coverage includes desktop OSs (Windows, macOS, Linux), mobile (Android, iOS), and web (various browsers, resolutions, dark/light mode). Application domains span e-commerce, productivity, entertainment, utilities, finance, and more (VenusBench-GD: 97 apps over 10 domains, (Zhou et al., 18 Dec 2025); GUI-Xplore: 312 apps, 33 categories, (Sun et al., 22 Mar 2025)).

Resolution, layout, and design style diversity are achieved by capturing at multiple viewport sizes (e.g., Insight-UI: 360×640—2560×1440), randomizing device and browser emulations, encouraging multi-device demonstrations (GUI Odyssey: six mobile devices), and explicitly balancing sampling (ZonUI-3B: 1:1:1 over platform) (Shen et al., 2024, Hsieh et al., 30 Jun 2025).

5. Benchmarking, Applications, and Research Impact

Cross-platform GUI datasets serve both as training sources and standardized testbeds for GUI agent evaluation across several task dimensions:

GUI grounding and pointing: Model predicts target coordinates for an instruction (e.g., "Click the Settings button"); accuracy measured as point-in-box hit, normalized distance, or IoU (Lu et al., 23 May 2025, Zhou et al., 18 Dec 2025).
Navigation and multi-step planning: Episode-level action matching, success rate (all actions correct), and reasoning trajectory analysis (OdysseyAgent (Lu et al., 2024), Aguvis (Xu et al., 2024)).
Cross-domain transfer evaluation: ΔAccuracy between platforms, versions, or apps (TransBench (Lu et al., 23 May 2025)); out-of-domain vs. in-domain splits.
Step-level critic and rationale evaluation: Binary Yes/No correctness plus rationale generation for stepwise decisions (OS-Oracle (Wu et al., 18 Dec 2025)).
Cross-platform GUI design: Pairwise GUI matching and conversion between devices (Papt: 10,035 phone-tablet GUI pairs, (Hu et al., 2023)), adaptive wireframe/semantic layout transfer.
Multi-level and adversarial grounding: Hierarchical taxonomy (VenusBench-GD), including advanced tasks (reasoning, refusal) to expose overfitting and probe global understanding (Zhou et al., 18 Dec 2025).
Practical automation and RPA: End-to-end tasks for robotic process automation, chain-of-thought modeling, and cross-website scripting (GUIDE (Chawla et al., 2024), V-Zen model).

Notable results include Falcon-UI achieving SOTA or near-SOTA performance on Android and Web benchmarks after pretraining on Insight-UI (Shen et al., 2024); empirical demonstration that high-fidelity, platform-diverse pretraining substantially boosts agent generalization, outperforming models focused on a single platform (Jang et al., 19 May 2025, Hsieh et al., 30 Jun 2025).

6. Limitations and Future Directions

Despite rapid growth, current cross-platform GUI datasets show important limitations:

Native app coverage: Large synthetic or web-dominated datasets (e.g., Insight-UI, OS-Atlas) provide limited or no binaries for native mobile apps (existing coverage is primarily via web rendering under mobile viewports or public APKs) (Shen et al., 2024).
Resolution and layout gaps: Exotic devices (foldables, ultra-wide screens) and high-DPI desktops are under-represented; fixed viewport sets (e.g., 6/view) dominate, omitting interpolation effects and edge-case layouts.
Interaction horizon: Most simulated interaction chains are shallow (cap at 5 steps); long-horizon, multi-form, multi-modal flows are rare, limiting longitudinal generalization and persistent memory modeling (Shen et al., 2024).
Language and modality: Presently, English and bilingual (EN/CN) datasets dominate; multi-lingual annotation, voice-driven, and dynamic content scenarios remain limited.
Synthetic vs. real-world validity: Datasets leveraging VLMs to generate instructions, rationales, or negative samples (e.g., OS-Oracle) may accumulate annotation noise or systematic biases. Human review is costly and not exhaustive.
Limited application types: Rare, non-mainstream, or highly-specialized applications are underrepresented. A plausible implication is that industrial or accessibility-focused automation research may demand further extension and curation.

Future work is centered on extending to true native app GUIs via emulator or device instrumentation, scaling to broader languages and locales, increasing interaction depth (goal-driven random walks, memory-rich agents), and integrating richer, accessibility-aware annotations (ARIA, screen reader metadata, live content) (Shen et al., 2024, Zhou et al., 18 Dec 2025). There is ongoing interest in modular, extensible collection pipelines and standardized APIs for community-driven expansion and rigorous benchmark maintenance (Wu et al., 2024, Liu et al., 18 Sep 2025).