Papers
Topics
Authors
Recent
2000 character limit reached

GUIDE Dataset: GUI Data for RPA & MLLM

Updated 20 November 2025
  • GUIDE dataset is a comprehensive collection of annotated GUI interactions, sourced from web applications like Apollo.io and Canva, designed for RPA and MLLM tasks.
  • It employs a meticulous semi-automated annotation pipeline that captures high-resolution screenshots, detailed task descriptions, action histories, and precise UI bounding boxes.
  • The dataset underpins benchmarks for next action prediction and grounding, with V-Zen achieving 93.2% task accuracy and 89.7% grounding accuracy, highlighting its research impact.

The GUIDE dataset refers to several datasets with distinct research purposes across domains. In the context of Robotic Process Automation (RPA) and Multimodal LLMs (MLLMs), GUIDE stands for "Graphical User Interface Data for Execution," a large-scale, meticulously annotated corpus of real-world graphical user interface (GUI) interactions designed to advance automation and multimodal reasoning tasks (Chawla et al., 9 Apr 2024). Separately, Guide3D is a bi-planar X-ray dataset targeting 3D reconstruction benchmarks for endovascular tools (Jianu et al., 29 Oct 2024). This article focuses on the former: GUIDE for RPA and MLLM research.

1. Dataset Scope and Structure

GUIDE comprises thousands of annotated interaction examples sourced from four widely used web applications: Apollo.io (62.67%), Canva (22.92%), Google Calendar (10.98%), and Gmail (3.43%). Each entry consists of a high-resolution screenshot, a natural language task description, a record of the last action performed, complete previous action history, a step-by-step chain-of-thought (CoT) rationale, the next intended action, and corresponding grounding coordinates for that action. The dataset incorporates systematic platform variation—Windows, macOS, Linux operating systems; Chrome, Firefox, Safari browsers; and both light and dark theming—augmented synthetically to induce interface diversity and simulate real user conditions (Chawla et al., 9 Apr 2024).

Table: Web Application Distribution

Application Percentage Primary Tasks
Apollo.io 62.67% Enterprise CRM, lead management, workflow navigation
Canva 22.92% Graphic design, text editing, layout manipulation
Google Calendar 10.98% Event creation, participant invite, scheduling
Gmail 3.43% Email composition, attachment, sending/archiving

2. Annotation Pipeline and Data Schema

Sample annotation is performed semi-automatically using the in-house browser tool NEXTAG (Next Action Grounding and Annotation Tool). This extension logs granular user interface interactions—including clicks, text entries, and scrolls—while capturing pixel-precise UI element coordinates. Annotators validate and edit suggested bounding boxes, enter task-specific instruction texts, and provide CoT rationales for each state transition. Each JSON-formatted annotation includes:

  • image_path: Path to the PNG screenshot.
  • task_description: Natural language goal.
  • previous_action: Label for the last action.
  • previous_action_history: Action sequence.
  • chain_of_thought: Free-text reasoning.
  • next_action: Canonical label for the immediate next step (e.g., "CLICK: send button").
  • grounding: Four floats (x1,y1,x2,y2)(x_1, y_1, x_2, y_2), normalized in [0,1][0,1], denoting the predicted bounding box of the target UI element.

Quality control involves systematic review cycles, entry-by-entry criteria checking (annotation correctness, image clarity, bounding box tightness), and, where needed, feedback to developers and annotators for iterative improvements. Inter-annotator agreement metrics are not explicitly reported (Chawla et al., 9 Apr 2024).

3. Platform Diversity and Augmentation Protocol

All GUI screenshots are modified to accurately reflect configuration variability across platforms:

  • Operating System: Dataset entries reflect Windows, macOS, and Linux window controls via synthetic augmentation.
  • Browsers: Chrome, Firefox, and Safari chrome elements (tab bars, window decorations) are superimposed to simulate authentic browsing contexts.
  • Themes and Layouts: Both light and dark mode appearances are represented; images undergo random cropping, jittering, border additions, and other augmentations to mimic variation in display resolutions and user customizations.
  • Responsiveness: Samples include UI layout changes, screen size shifts, and element repositioning to model the heterogeneity of real deployments.

This systematic variation enables the dataset to benchmark models not just within but across platforms, supporting generalization evaluation (Chawla et al., 9 Apr 2024).

4. Benchmarks and Baseline Evaluations

GUIDE is the basis for multiple core RPA and MLLM tasks:

  • Next Action Prediction: Given a screenshot and current context, models predict the textual label of the next logical action.
  • Grounding: Predicting precise UI element bounding boxes for a chosen action.
  • Chain-of-Thought Reasoning: Generating or using stepwise rationales to aid action and grounding prediction.

The V-Zen baseline, a vision-language fusion model (LVLM) featuring a high-resolution visual encoder and transformer-based text encoder, is trained to predict both the next action and the associated GUI element bounding box. V-Zen leverages cross-attention mechanisms between linguistic (task, CoT) and visual features. Its efficacy is compared with frontier LLMs (e.g., GPT-4V: 94% task accuracy, Gemini-Pro: 92%). V-Zen achieves a task prediction accuracy of 93.2% and grounding accuracy of 89.7%. Ablation studies detail the contribution of components such as previous action history (+18.7% total to grounding), chain-of-thought annotation, and metadata augmentations (+29% for grounding when including OS/Browser metadata) (Chawla et al., 9 Apr 2024).

5. Data Access, Organization, and Formats

GUIDE is distributed as PNG screenshots and corresponding JSON annotation files; CSV exports are available for tabular analysis. The dataset is accessible via GitHub (https://github.com/superagi/GUIDE) and HuggingFace (https://huggingface.co/datasets/SuperAGI/GUIDE), facilitating integration into RPA and LLM research workflows. The bounding box format is explicitly given as:

bbox=(x1,y1,x2,y2),0x1<x21; 0y1<y21\mathrm{bbox} = (x_{1},\,y_{1},\,x_{2},\,y_{2}),\quad 0 \le x_{1} < x_{2} \le 1;\ 0 \le y_{1} < y_{2} \le 1

Data organization is by application and scenario, with samples encompassing annotated screenshots, context history, text actions, and normalized box coordinates (Chawla et al., 9 Apr 2024).

6. Potential Applications, Coverage Limits, and Future Extensions

GUIDE enables multiple research and engineering efforts:

  • Benchmarking cross-interface, cross-platform RPA models (browser, OS, theme variability).
  • Training and evaluating multimodal LLMs for GUI understanding, reasoning, and planning.
  • Tool-augmented LLM workflows that chain natural language understanding, grounding, and automation primitives.

Limitations include:

  • Coverage of only four web applications (excluding desktop software and verticals like finance or healthcare).
  • Annotation bias due to human-labeled boxes and subjective CoT rationales.
  • Dataset snapshots may lag behind evolving live UIs (“interface drift”); not all keyboard/contextual inputs are represented.
  • Expansion to broader application domains, error-handling scenarios, and automated dataset versioning is identified as a future direction.
  • A suggested future trajectory is the inclusion of user-triggered exceptions, inter-annotator agreement statistics, and multi-step end-to-end planning tasks (Chawla et al., 9 Apr 2024).

7. Context within the Broader RPA and MLLM Ecosystem

The GUIDE dataset furnishes a pivotal resource for developing agents with generalized, multi-platform GUI interaction and reasoning skills. Its multimodal annotation schema and stringent platform augmentations set a new standard for RPA agent evaluation and contribute to aligning LLM research with practically grounded automation tasks. While limited in software diversity, GUIDE's detailed supervision and open accessibility facilitate both methodological advances in multimodal vision-language modeling and industrial automation research (Chawla et al., 9 Apr 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to GUIDE Dataset.