ShowUI-π: Flow-Based GUI Automation

Updated 7 January 2026

ShowUI-π is a flow-based generative model framework that unifies discrete clicks with continuous drags for enhanced GUI manipulation.
It employs an ODE-based action expert to generate smooth, real-time cursor trajectories through a unified continuous action space.
Empirical evaluations on the ScreenDrag benchmark show significant improvements in task success rate and endpoint accuracy compared to larger baselines.

ShowUI- $\pi$ is a @@@@1@@@@ framework conceptualized as a GUI dexterous hand. It addresses the challenge of enabling intelligent agents to perform dexterous manipulation in digital environments by unifying discrete and continuous interaction modalities. Unlike previous systems that model GUI manipulation as sequences of discrete (x, y) click predictions and consequently lack the capacity for free-form, closed-loop trajectories such as real-time dragging, ShowUI- $\pi$ introduces a unified action space and a novel flow-based action generation approach. This system outperforms larger language-model-driven agents in both offline and online evaluation protocols on the newly introduced ScreenDrag benchmark, demonstrating efficient and flexible adaptation across a diverse set of GUI manipulation tasks (Hu et al., 31 Dec 2025).

1. Unified Discrete–Continuous Action Space

ShowUI- $\pi$ models both GUI clicks and drags within a single, continuous action representation. Each action sequence $\mathcal{A} = [a_1, a_2, ..., a_H]$ consists of atomic actions $a_k = (x_k, y_k, m_k)$ , where $m_k \in \{\mathrm{down},\mathrm{up}\}$ encodes the mouse state. A click is represented as a two-step sequence $[(x,y,\mathrm{down}),(x,y,\mathrm{up})]$ , while a drag is captured as a sequence $[(x_1,y_1,\mathrm{down}),...,(x_T,y_T,\mathrm{up})]$ reflecting the entire press-hold trajectory. This unified formulation enables a single neural head to model both modalities, eliminating the need for discrete tool tokenization or modality-specific architectures.

Modality	Atomic Action Sequence Representation	Specialization Required
Click	$[(x, y, \mathrm{down}), (x, y, \mathrm{up})]$	None
Drag	$[(x_1, y_1, \mathrm{down}), \ldots, (x_T, y_T, \mathrm{up})]$	None

This approach provides uniform modeling benefits, improves architectural efficiency, and supports closed-loop perception and adaptation during execution (Hu et al., 31 Dec 2025).

2. Flow-based Action Generation via ODE Integration

The primary architectural innovation in ShowUI- $\pi$ is the flow-based action expert, a lightweight module that predicts smooth cursor trajectories through an ODE-based velocity field. Instead of tokenizing actions, the system directly integrates over a time-indexed field $v_\theta$ :

$\frac{d \hat{a}(s)}{ds} = v_\theta(\hat{a}(s), s\mid o_t, Q),\qquad \hat{a}(0) = \text{start point},\quad \hat{a}(1) = \text{end point}$

where $o_t$ is the visual observation (screenshot), and $Q$ is the language instruction. At inference, the agent encodes $(o_t, Q, a_{t-1})$ with a VLM backbone to obtain $h_t$ , then integrates the ODE in chunks of $H$ waypoints (e.g., Euler integration) by iteratively predicting and executing trajectory points. Each chunk is re-observed after execution for closed-loop adjustment. This enables real-time, environmentally responsive control over both discrete clicks and continuous drags.

3. Training Objectives: Flow-Matching and Directional Regularization

Training the action expert involves supervision via flow-matching to the ground-truth velocity field, supplemented by directional regularization. Let $\{u_t\}_{t=1}^T$ denote ground-truth velocities, and $\hat{a}_t$ the model’s intermediate states. The losses are:

Weighted Flow-Matching Loss:

$\mathcal{L}_{\text{weighted}} = \sum_{t=1}^T w_t \| v_\theta(\hat{a}_t, t/T \mid o_t, Q) - u_t \|^2$

with $w_1 = w_T = 10$ and $w_t = 1$ otherwise.

Directional Regularization:

$\mathcal{L}_{\text{reg}} = \frac{1}{T} \sum_{t=1}^T [1 - \cos(u_t, v_\theta(\hat{a}_t, t/T \mid o_t, Q))]$

Total Loss:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{weighted}} + \lambda \mathcal{L}_{\text{reg}},\quad \lambda = 0.1$

Unlike density-based normalizing flows, ShowUI- $\pi$ ’s framework does not require modeling log-determinants or support bijections; only a deterministic velocity field is integrated (Hu et al., 31 Dec 2025).

4. Drag Demonstration Data and ScreenDrag Benchmark

A cornerstone of ShowUI- $\pi$ is a large-scale dataset and a rigorous benchmark for drag-based GUI manipulation:

Data: 20,000 drag trajectories across 11 task categories in five domains: PowerPoint (rotate/resize), OS Desktop/File-Manager (file sorting), handwriting canvas, Adobe Premiere Pro (clip arrangement and effects), and Captcha solving (slider/puzzle/rotate).
- Element Parsing: UIA APIs for bounding boxes and metadata.
- Task Proposal: Instructions and metadata changes generated by Qwen-2.5-72B.
- Trajectory Synthesis: PyAutoGUI executes dense waypoint traces.
- Human Demos: Screen recording plus raw mouse capture for domains lacking metadata access.
Dataset Statistics: 20,000 training trajectories, 505 evaluation tasks, average duration 9.62s, $\sim$ 577 frames per task.

ScreenDrag, the accompanying benchmark, provides offline (open-loop) and online (closed-loop) evaluation. Offline protocols use pre-recorded initial states and oracle trajectories, measuring Average Trajectory Error (ATE) and Trajectory Endpoint Accuracy (TEA). Online protocols use a data-driven emulator, tracking Task Success Rate (TSR) by matching predicted points with ground-truth and feeding back subsequent screenshots (Hu et al., 31 Dec 2025).

5. Empirical Results and Ablation Analysis

ShowUI- $\pi$ demonstrates advanced capability relative to proprietary and open-source baselines. With only 450M parameters, it outperforms models exceeding 7B parameters:

Method	Online TSR (%)	Params	Offline Endpoint (%)	Offline Error (px)
Operator (OpenAI)	13.27	–	–	–
Seed-1.6-Vision	19.01	–	–	–
Gemini-2.5-CUA	22.18	–	20.00	189.15
OpenCUA-7B	21.98	7B	–	–
Qwen3-VL-8B	7.52	8B	–	–
ShowUI- $\pi$ -450M	26.98	0.45B	78.55	159.05
Diffusion policy	–	–	47.33	267.92

Ablation results establish the architectural efficacy:

Separate click/drag heads increase parameters to 550M and reduce TSR to 23.25%.
A chunk size $H=20$ with one-step execution and re-observation delivers optimal endpoint accuracy (78.55%).
Temporal weighting ( $w_1=w_T=10$ ) increases TSR from 10.49% to 26.98%.
Directional regularization ( $\lambda=0.1$ ) boosts overall TSR from 12.63% to 26.98%, especially in Captcha domains (Hu et al., 31 Dec 2025).

6. Significance and Prospects

ShowUI- $\pi$ reconceptualizes GUI automation by treating dragging as a real-time, closed-loop continuous control problem rather than a sequence of discrete actions. This allows more human-like dexterous control in digital interfaces. By leveraging a unified and compact flow-matching action head and training with dense multi-domain demonstrations, the system surmounts evaluation benchmarks where state-of-the-art LLM agents struggle. A plausible implication is that flow-based, closed-loop paradigms could generalize effectively to robotics and broader human-computer interaction domains, contingent on dataset diversity and robust observation-action coupling (Hu et al., 31 Dec 2025).

Markdown Upgrade to Chat

References (1)

ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShowUI-$π$.