UI-Venus: A Multimodal UI Agent

Updated 15 August 2025

UI-Venus is a screenshot-based UI agent that leverages a multimodal Qwen2.5-VL model to perform grounding and navigation tasks without relying on structured UI representations.
It employs a tailored Group Relative Policy Optimization (GRPO) framework for reinforcement finetuning, achieving efficient learning and robust performance in sparse reward settings.
The system integrates innovative strategies like self-evolving history alignment and sparse action enhancement to improve multi-step planning and rare action execution.

UI-Venus is a native user interface (UI) agent designed to operate end-to-end on screenshots, applying a multimodal LLM (MLLM, Qwen2.5-VL-based) to achieve state-of-the-art performance on UI grounding and navigation tasks using only high-quality visual observations. The system is distinguished by its use of reinforcement finetune (RFT) within a Group Relative Policy Optimization (GRPO) framework, a rigorous data cleaning pipeline, and novel self-evolving strategies that align historical reasoning traces and enhance generalization for sparse but critical actions (Gu et al., 14 Aug 2025).

1. System Architecture and Agent Modules

UI-Venus comprises two principal modules: UI-Venus-Ground and UI-Venus-Navi. Both operate purely on screenshot input without access to structured UI representations or explicit accessibility trees.

UI-Venus-Ground: Specializes in grounding; given an instruction and screenshot, it predicts the bounding box or point (CENTER), using a “no-think” mode (i.e., direct answer prediction for fast inference).
UI-Venus-Navi: Handles navigation; processes both the current screenshot and concatenated history of prior thought–action pairs, outputting the next reasoning step (“think”) and UI action in sequence. The navigation agent maintains a dynamic history updated at each step, enabling multilayered, chain-of-thought reasoning and explicit multi-step planning.

The model architecture extends Qwen2.5-VL (with 7B and 72B parameter variants), adding visual feature alignment layers and custom heads for action/coordinate prediction. The screenshot stream and historical context are encoded and fused via cross-modal attention before the final policy head.

This unified, screenshot-centric input structure circumvents the need for pre-parsed UI element trees, contrasting with two-stage systems or ones requiring auxiliary planning modules.

2. Training Methodology and Reinforcement Finetuning (RFT)

UI-Venus applies reinforcement finetune (RFT) as its core optimization paradigm, departing from standard supervised fine-tuning (SFT). RFT is realized via GRPO, a KL-regularized, group-based variant of PPO tailored to address sample efficiency and reward sparsity in UI tasks. Key features:

Multiple Rollouts per Query: For each prompt $q$ , the system samples $G$ rollouts $\{ o_1, o_2, \ldots, o_G \}$ with associated rewards $\{ r_1, r_2, \ldots, r_G \}$ .
Relative Advantage Normalization: Rewards are normalized within the rollout group: $\hat{A}_i = (r_i - \text{mean}(\{ r_1, \ldots, r_G \}))/\text{std}(\{ r_1, \ldots, r_G \})$ .
Training Objective:

$\mathcal{J}_{\mathrm{GRPO}}(\pi_\theta) = \mathbb{E}_{q,\{o_i\}} \Bigg( \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\Big[ \frac{\pi_\theta(o_{i,t}\mid q,o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}\mid q,o_{i,<t})} \cdot \hat{A}_i, \operatorname{clip}(\cdot) \cdot \hat{A}_i - \beta D_{\mathrm{KL}}[\pi_\theta\parallel\pi_{\text{ref}}] \Big] \Bigg)$

Sample Efficiency: Only several hundred thousand high-quality samples are required, as opposed to millions typically seen in SFT domains.

RFT is essential for achieving high precision on grounding (localization) and navigation (action sequencing) under sparse and delayed reward settings intrinsic to complex UI tasks.

3. Reward Engineering and Data Cleaning Protocols

UI-Venus utilizes carefully crafted reward functions, tailored per task:

Grounding Task Rewards:
- Formatting Reward $R_{\text{format}}$ : Binary, checks if prediction matches required format.
- Point-in-Box Reward $R_{\text{point-in-box}}$ :
$R_{\text{point-in-box}} = \begin{cases} 1 & \text{if } x_1 \leq x_c \leq x_2 \text{ and } y_1 \leq y_c \leq y_2 \ 0 & \text{otherwise} \end{cases}$ - Total reward: Weighted combination, $R = w_1 \cdot R_{\text{format}} + w_2 \cdot R_{\text{point-in-box}}$ .
Navigation Task Rewards:
- Encompasses format correctness, action type accuracy, spatial thresholds (for point/scroll), and token-level F1-score for text.
- Rewards are aggregated per step to guide multi-stage action accuracy and fluency.

A stringent data cleaning pipeline underlies the training process:

Grounding: An aggregate of 627k open-source samples undergoes ambiguity filtering, manual correction, and redundancy re-sampling, yielding a final set of 107k “clean” examples.
Navigation: Employs a multi-tiered filter-reconstruct-generate process, including CallUser step insertion, scroll direction normalization, and iterative trace generation/validation with annotations and models, resulting in ~350k high-quality navigation traces.

This protocol enhances label reliability and mitigates noise-induced performance loss.

4. Self-Evolving Trajectory History Alignment and Sparse Action Enhancement

To refine trajectory planning and ensure robust handling of rare actions:

Self-Evolving History Alignment: After each training epoch, the model re-generates (via beam search) candidate thought–action histories for each sample. Only candidates that produce the correct action are retained, creating a “thought pool”; subsequent histories are updated by selecting those aligning with desired length and logical consistency. This iterative memory update mitigates drifting or hallucinated reasoning sequences.
Sparse Action Enhancement: The agent faces long-tailed distributions where critical actions (e.g., LongPress, CallUser) are infrequent. The method samples an augmented set of historical contexts by combinatorially assembling thoughts from earlier steps. This enhances occurrence of rare actions during training, ensuring robust acquisition of such behaviors.

Both mechanisms are applied directly during RFT and contribute to superior performance in multi-step, context-dependent UI tasks compared to prior static approaches.

5. Empirical Performance and Baseline Comparisons

UI-Venus establishes state-of-the-art results on multiple grounding and navigation benchmarks:

Model/Variant	ScreenSpot-V2	ScreenSpot-Pro	AndroidWorld (Navi)
UI-Venus 7B	94.1%	50.8%	49.1%
UI-Venus 72B	95.3%	61.9%	65.9%
GTA1 (open-source)	<94.1%	<50.8%	<49.1%
UI-TARS-1.5 (closed)	<95.3%	<61.9%	<65.9%

Grounding: UI-Venus surpasses both prior open-source (GTA1) and closed-source (UI-TARS-1.5) models, particularly on challenging “Pro” levels.
Navigation: Performance on AndroidWorld demonstrates superior planning/generalization, attributed to the history alignment and sparse action enhancements, as well as cleaner data and RFT methodology.

The unified, end-to-end approach—eschewing requirement for explicit planners, element trees, or scripted feature extraction—further distinguishes UI-Venus from existing systems.

6. Limitations and Open Problems

UI-Venus faces several unresolved challenges and active research topics:

Hallucination Gap: The misalignment between internally generated “think” steps and observed action output remains, potentially due to noisy or weakly supervised historical annotations; further work is needed in consistency enforcement.
Planning/Memory: Advanced architectures or explicit trajectory memory pretraining may improve stability and performance in dynamic or real-world UI contexts.
Reward Engineering: While current reward functions yield strong results, dynamic or more granular designs may further reduce instability, particularly in complex tasks where stepwise improvement is desirable.
Multi-task Training: Joint optimization across grounding and navigation tasks requires careful balance to avoid reward conflicts; scaling up with more diverse trajectory data is an avenue for improved transfer and generalization.

A plausible implication is that continued progress in these areas may enable UI agents to handle increasingly complex or unseen UIs, further bridging the gap between artificial and human-level multimodal interaction competence.

7. Future Prospects and Community Impact

UI-Venus releases both code and data protocols, furthering standardization and reproducibility in UI agent research. The architecture’s modularity, strong empirical results, and self-improving strategies create a platform for future extensions—whether via larger-scale pretraining, reinforcement curricula, or integration with test-time adaptation and on-device deployment strategies. The comprehensive approach outlined sets a new technical baseline for screenshot-based UI automation and benchmarking (Gu et al., 14 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

UI-Venus Technical Report: Building High-performance UI Agents with RFT (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to UI-Venus.