ShowUI-π: Flow-Based GUI Automation
- ShowUI-π is a flow-based generative model framework that unifies discrete clicks with continuous drags for enhanced GUI manipulation.
- It employs an ODE-based action expert to generate smooth, real-time cursor trajectories through a unified continuous action space.
- Empirical evaluations on the ScreenDrag benchmark show significant improvements in task success rate and endpoint accuracy compared to larger baselines.
ShowUI- is a @@@@1@@@@ framework conceptualized as a GUI dexterous hand. It addresses the challenge of enabling intelligent agents to perform dexterous manipulation in digital environments by unifying discrete and continuous interaction modalities. Unlike previous systems that model GUI manipulation as sequences of discrete (x, y) click predictions and consequently lack the capacity for free-form, closed-loop trajectories such as real-time dragging, ShowUI- introduces a unified action space and a novel flow-based action generation approach. This system outperforms larger language-model-driven agents in both offline and online evaluation protocols on the newly introduced ScreenDrag benchmark, demonstrating efficient and flexible adaptation across a diverse set of GUI manipulation tasks (Hu et al., 31 Dec 2025).
1. Unified Discrete–Continuous Action Space
ShowUI- models both GUI clicks and drags within a single, continuous action representation. Each action sequence consists of atomic actions , where encodes the mouse state. A click is represented as a two-step sequence , while a drag is captured as a sequence reflecting the entire press-hold trajectory. This unified formulation enables a single neural head to model both modalities, eliminating the need for discrete tool tokenization or modality-specific architectures.
| Modality | Atomic Action Sequence Representation | Specialization Required |
|---|---|---|
| Click | None | |
| Drag | None |
This approach provides uniform modeling benefits, improves architectural efficiency, and supports closed-loop perception and adaptation during execution (Hu et al., 31 Dec 2025).
2. Flow-based Action Generation via ODE Integration
The primary architectural innovation in ShowUI- is the flow-based action expert, a lightweight module that predicts smooth cursor trajectories through an ODE-based velocity field. Instead of tokenizing actions, the system directly integrates over a time-indexed field :
where is the visual observation (screenshot), and is the language instruction. At inference, the agent encodes with a VLM backbone to obtain , then integrates the ODE in chunks of waypoints (e.g., Euler integration) by iteratively predicting and executing trajectory points. Each chunk is re-observed after execution for closed-loop adjustment. This enables real-time, environmentally responsive control over both discrete clicks and continuous drags.
3. Training Objectives: Flow-Matching and Directional Regularization
Training the action expert involves supervision via flow-matching to the ground-truth velocity field, supplemented by directional regularization. Let denote ground-truth velocities, and the model’s intermediate states. The losses are:
- Weighted Flow-Matching Loss:
with and otherwise.
- Directional Regularization:
- Total Loss:
Unlike density-based normalizing flows, ShowUI-’s framework does not require modeling log-determinants or support bijections; only a deterministic velocity field is integrated (Hu et al., 31 Dec 2025).
4. Drag Demonstration Data and ScreenDrag Benchmark
A cornerstone of ShowUI- is a large-scale dataset and a rigorous benchmark for drag-based GUI manipulation:
- Data: 20,000 drag trajectories across 11 task categories in five domains: PowerPoint (rotate/resize), OS Desktop/File-Manager (file sorting), handwriting canvas, Adobe Premiere Pro (clip arrangement and effects), and Captcha solving (slider/puzzle/rotate).
- Element Parsing: UIA APIs for bounding boxes and metadata.
- Task Proposal: Instructions and metadata changes generated by Qwen-2.5-72B.
- Trajectory Synthesis: PyAutoGUI executes dense waypoint traces.
- Human Demos: Screen recording plus raw mouse capture for domains lacking metadata access.
- Dataset Statistics: 20,000 training trajectories, 505 evaluation tasks, average duration 9.62s, 577 frames per task.
ScreenDrag, the accompanying benchmark, provides offline (open-loop) and online (closed-loop) evaluation. Offline protocols use pre-recorded initial states and oracle trajectories, measuring Average Trajectory Error (ATE) and Trajectory Endpoint Accuracy (TEA). Online protocols use a data-driven emulator, tracking Task Success Rate (TSR) by matching predicted points with ground-truth and feeding back subsequent screenshots (Hu et al., 31 Dec 2025).
5. Empirical Results and Ablation Analysis
ShowUI- demonstrates advanced capability relative to proprietary and open-source baselines. With only 450M parameters, it outperforms models exceeding 7B parameters:
| Method | Online TSR (%) | Params | Offline Endpoint (%) | Offline Error (px) |
|---|---|---|---|---|
| Operator (OpenAI) | 13.27 | – | – | – |
| Seed-1.6-Vision | 19.01 | – | – | – |
| Gemini-2.5-CUA | 22.18 | – | 20.00 | 189.15 |
| OpenCUA-7B | 21.98 | 7B | – | – |
| Qwen3-VL-8B | 7.52 | 8B | – | – |
| ShowUI--450M | 26.98 | 0.45B | 78.55 | 159.05 |
| Diffusion policy | – | – | 47.33 | 267.92 |
Ablation results establish the architectural efficacy:
- Separate click/drag heads increase parameters to 550M and reduce TSR to 23.25%.
- A chunk size with one-step execution and re-observation delivers optimal endpoint accuracy (78.55%).
- Temporal weighting () increases TSR from 10.49% to 26.98%.
- Directional regularization () boosts overall TSR from 12.63% to 26.98%, especially in Captcha domains (Hu et al., 31 Dec 2025).
6. Significance and Prospects
ShowUI- reconceptualizes GUI automation by treating dragging as a real-time, closed-loop continuous control problem rather than a sequence of discrete actions. This allows more human-like dexterous control in digital interfaces. By leveraging a unified and compact flow-matching action head and training with dense multi-domain demonstrations, the system surmounts evaluation benchmarks where state-of-the-art LLM agents struggle. A plausible implication is that flow-based, closed-loop paradigms could generalize effectively to robotics and broader human-computer interaction domains, contingent on dataset diversity and robust observation-action coupling (Hu et al., 31 Dec 2025).