ShowUI-Aloha: Autonomous GUI Automation
- ShowUI-Aloha is a comprehensive pipeline that transforms natural desktop screen recordings into semantically annotated teaching trajectories.
- It utilizes advanced vision–language models and rule-based processing to convert raw video and OS event logs into structured, actionable data.
- The system demonstrates robust performance with closed-loop planning and OS-level execution, achieving high success rates across diverse applications.
ShowUI-Aloha is a comprehensive pipeline for building autonomous agents capable of automating complex desktop GUI workflows by learning directly from unstructured, in-the-wild human screen recordings. Targeting the persistent bottleneck of scalable, high-quality training data for GUI automation, ShowUI-Aloha addresses the challenges of raw, unannotated video and event logs by converting them into structured, semantically meaningful teaching trajectories which can then be leveraged for robust and generalizable agent execution. The architecture integrates advanced vision–LLMs (VLMs) for semantic interpretation, an LLM-based planner for contextualized task decomposition, and a low-level OS executor, establishing a closed-loop system from demonstration to autonomous action (Zhang et al., 12 Jan 2026).
1. Motivation and Problem Space
Graphical User Interfaces (GUIs) dominate human–computer interaction in productivity and knowledge work. Automation of complex GUI tasks has remained challenging due to the absence of scalable, real-world data capturing true user behavior. Existing automation systems are constrained by template or rule-based limitations, or by vision–language–action agents trained on synthetic datasets prone to domain overfitting and brittle generalization. Zero-shot LLMs have not reliably mastered multi-step software logic due to the lack of high-fidelity, semantically annotated, temporally structured demonstrations.
ShowUI-Aloha seeks to bridge the data gulf between demand for massive, realistic, task-diverse GUI teaching data and what current scripting or curated datasets can provide. The system is premised on the insight that unstructured human demonstrations, once precisely recorded and semantically decomposed, are a critical resource for learning robust software automation policies. The ultimate goal is to enable agents that can learn general GUI task solutions simply by observing humans in naturalistic settings [(Zhang et al., 12 Jan 2026), Sec. 1].
2. System Architecture and Pipeline
The ShowUI-Aloha framework encompasses a four-stage pipeline: Recorder, Learner, Planner (Aloha Actor front-end), and Executor (Aloha Actor back-end) [(Zhang et al., 12 Jan 2026), Sec. 3].
| Module | Core Input/Output | Functional Role |
|---|---|---|
| Recorder | Desktop frames, OS event stream | Captures raw user interaction traces and video |
| Learner | Frames + event logs | Translates raw data to semantic task trajectories |
| Planner | JSON traces, screenshots | Plans next UI action given context and history |
| Executor | Structured commands | Executes GUI actions at OS level with feedback |
2.1 Recorder Captures full-HD (1920×1080, 30FPS) video of the desktop, synchronized with a millisecond-resolved log of primitive OS events (mouse, keyboard, scrolls), using FFmpeg and a KeyCastOW-derived logger. The interface allows for session tagging and batch capture to support large-scale, real-world data collection [(Zhang et al., 12 Jan 2026), Sec. 3.1].
2.2 Learner Processes the raw log into a minimal sequence of primitives via rule-based consolidation:
- Merges mouse sequences into discrete drags,
- Groups keystrokes,
- Normalizes scrolls,
- Removes redundant actions (e.g., double-click de-duplication).
For each cleaned action, the learner generates a pair of images (full screen and context crop), overlaying a visual marker to encode action intent. Using an off-the-shelf vision–LLM (e.g., GPT-4o), the system, through prompt-driven few-shot learning, produces for each step a structured JSON with fields for Observation, Think, Action, and Expectation. This yields the core semantically annotated "teaching trajectory" [(Zhang et al., 12 Jan 2026), Sec. 3.2].
2.3 Aloha Actor: Planner and Executor The planner ingests a live screenshot, semantic guidance trace, task prompt, and execution history, using an LLM to generate the next action plan. The architecture maintains memory to avoid plan drift. The executor maps action commands to absolute screen coordinates, dispatches to the OS via platform-independent wrappers (such as OpenAI Computer-Use API), and verifies state post-action, with fallback logic for ambiguous visual conditions. Safety checks and real-time overlays are employed throughout [(Zhang et al., 12 Jan 2026), Sec. 3.3–3.4].
3. Data Acquisition and Semantic Processing
The system's data layer is characterized by its ability to ingest, without manual annotation, large quantities of natural desktop interaction. Recordings deliver synchronized screen video and granular OS event logs. The subsequent semantic transformation employs a three-stage pipeline: action cleaning, screenshot marking, and trace generation. Labels and captions for each action are auto-generated using prompt engineering and vision–LLMs without the need for domain-specific pre-training or fine-tuning.
Each step in the teaching trajectory is formatted as:
1 2 3 4 5 6 |
{
"Observation": ...,
"Think": ...,
"Action": ...,
"Expectation": ...
} |
4. Planning, Execution, and State Management
The planner maintains a structured state tuple , and selects actions from an atomic set: click, double_click, drag, scroll, type, hotkey, wait. The closed-loop execution cycle iterates through plan, parse, execute, verify, and memory-update stages, minimizing deviation from the demonstration trajectory under real-world context shifts.
Planning is governed by a minimize drift objective:
A reward is referenced but not explicitly calculated, given closed-loop verification in the absence of a learned evaluator. Actions are mapped to OS-level APIs with adjustments for multi-monitor setups, keyboard modifiers, and required OS timings. Fallback mechanisms and retries are triggered upon mismatches between expected and actual post-action states [(Zhang et al., 12 Jan 2026), Sec. 3.3–4.3].
5. Experimental Protocols and Empirical Results
ShowUI-Aloha was evaluated on 361 real-world OSWorld desktop tasks executed on native Windows and macOS environments, with up to 50 steps per run and demonstrations supplied by human testers. Demonstrations were intentionally varied in entity names to ensure generalization rather than rote replay. The primary metric was strict binary success: the final state must strictly match the goal, with no partial credit [(Zhang et al., 12 Jan 2026), Sec. 4.1].
Key quantitative results:
| Category | Success Rate (%) |
|---|---|
| Chrome | 91.3 |
| OS operations | 83.3 |
| Thunderbird | 80.0 |
| VS Code | 73.9 |
| Writer | 69.6 |
| GIMP | 65.4 |
| VLC | 64.7 |
| Calc | 57.4 |
| Impress | 42.6 |
| Multi-apps | 37.6 |
| Overall | 60.1 |
Compared to other baselines: ShowUI-Aloha (60.1%), CoAct-1 (56.4%), Agent S2.5 (54.2%), Jedi-7B (50.6%). Ablations demonstrated that removing the demonstration trace reduced success to 36.7% (Step-Norm 0.56), and disabling planner memory led to 50.0% success (Step-Norm 0.68), evidencing the criticality of both memory and trajectory guidance [(Zhang et al., 12 Jan 2026), Sec. 4.4].
Dominant failure modes were attributed to element localization (53.5%), problems with text/field editing (16.0%), misaligned actions (14.6%), trajectory stalls (8.3%), and other causes (7.6%) [(Zhang et al., 12 Jan 2026), Fig. 16].
6. Analysis: Strengths, Limitations, and Prospects
The primary strengths of ShowUI-Aloha include scalable, annotation-free acquisition of human GUI workflows, automated semantic abstraction enabling generalization across UI drift, and a robust closed-loop of planning, verification, and recovery. The end-to-end system is fully open-source and supports large-scale deployment [(Zhang et al., 12 Jan 2026), Sec. 6.1].
Limitations persist, notably in fine-grained icon disambiguation within dense toolbars, susceptibility to error in drag-based text selection, and the current inability to generalize in a zero-demo regime—each workflow requires at least one demonstration [(Zhang et al., 12 Jan 2026), Sec. 6.2].
Future directions described include improved icon semantic grounding via combined OCR and icon classifiers, more robust sub-skill models for drag and text-editing, transition toward few-shot or demonstration-free execution via compact vision–language–action models, and scaling to real-time human–AI co-authoring modes. These directions point towards increasing autonomy, efficiency, and adaptability of GUI agents [(Zhang et al., 12 Jan 2026), Sec. 6.3].
7. Position Within HCI and Autonomous Agents Research
ShowUI-Aloha occupies a unique position at the intersection of human–computer interaction, demonstration learning, and autonomous software agents. Unlike earlier approaches based on zero-shot LLMs or multimodal models trained on synthetic, curated data, ShowUI-Aloha establishes that properly processed, uncurated real-user data enables more reliable and generalizable automation. A plausible implication is that closed-loop human demonstration, grounded by semantically rich trajectories and memory-equipped planning, sets a new standard for GUI task automation pipelines [(Zhang et al., 12 Jan 2026), Sec. 1.2].