Ferret-UI Lite: A Compact GUI Agent

Updated 1 October 2025

Ferret-UI Lite is a compact GUI agent that employs a two-stage training regime combining supervised fine-tuning with reinforcement learning to optimize multimodal interactions.
It integrates chain-of-thought reasoning and a visual zoom-in mechanism to enhance spatial localization and improve GUI grounding on resource-constrained devices.
The model leverages a diverse real and synthetic data mixture, achieving competitive performance in both GUI grounding and navigation tasks across various platforms.

Ferret-UI Lite is a compact, multimodal end-to-end GUI agent designed for on-device deployment and capable of operating across diverse platforms, including mobile, web, and desktop environments. Targeted at efficient and autonomous Human–Computer Interaction (HCI), its small-scale design (3B parameters) leverages a combination of advanced machine learning methodologies—ranging from data mixture engineering and chain-of-thought reasoning to reinforcement learning with verifiable rewards—while achieving competitive results on both GUI grounding and navigation tasks. Ferret-UI Lite represents an overview of insights obtained from larger UI-understanding models, optimized for scenarios where computational and memory constraints are paramount (Yang et al., 30 Sep 2025).

1. Model Architecture and Development Techniques

Ferret-UI Lite employs a lightweight, end-to-end, multimodal architecture engineered specifically for embedded systems. The agent realizes strong performance in its restricted parameter budget (3B) through several core technical strategies:

Two-stage Training Regime: The first stage consists of supervised fine-tuning (SFT) over a heterogeneous GUI data mixture that encodes diverse user interactions and visual layouts. The second stage applies reinforcement learning (RL) with verifiable rewards (RLVR), enabling domain adaptation and improved action reliability.
Chain-of-Thought (CoT) Reasoning: At inference, the model outputs a sequence of "think–plan–act" traces, generating not just actions but also intermediate textual reasoning—such as planning statements, action-specific analyses, and reflective self-assessments—which improves long-horizon navigation robustness.
Visual Tool-Use with Zoom-In Mechanism: For fine-grained grounding, the agent predicts an action region, then crops the input image to focus on this region and runs a secondary prediction. This approach mimics human attention and reduces the need for full-resolution, global processing.
Unified Action Representation: Actions are encoded as function calls with fixed parameter structures, rather than as free-form text. This design facilitates direct mapping to tool-use interfaces and improves both interpretability and downstream extraction.

This combination of techniques is directly tailored to maximize utility of a small parameter footprint while retaining flexibility and applicability across platform types and interface modalities.

2. Data Mixture Design

The training data for Ferret-UI Lite integrates real and synthetic sources in a unified annotation schema, thereby maximizing the model's ability to generalize across domains:

Real Data Sources: Benchmarks such as GroundUI, OSAtlas, UGround, Aria-UI, Aguvis, WaveUI, ShowUI, Jedi, and AgentNet supply cross-platform human-annotated data, covering diverse bounding and pointing schema. All annotations are normalized to a point-based representation:

$(x_{\mathrm{center}}, y_{\mathrm{center}}) = \left(\frac{x_{\min} + x_{\max}}{2}, \frac{y_{\min} + y_{\max}}{2}\right)$

Synthetic Data Generation: Synthetic samples expand coverage by creating denser high-resolution grounding layouts (via screenshot stitching), generating chain-of-thought traces using GPT-4 with visual input, and rewriting navigation goals into QA pairs. Additionally, a multi-agent online simulator generates action rollouts, deliberately injecting errors and recovery paths to enrich the navigation experience.

The diversity of this data mixture not only enhances the agent's grounding and navigation capabilities but also equips the model to handle previously unseen interaction scenarios during inference.

3. Inference-Time Enhancements

Several techniques are employed at inference to bolster the model's accuracy and robustness:

Multi-step Chain-of-Thought Reasoning: The agent’s output is split across three reasoning segments—plan (succinct action intent), action think (analysis of UI elements and historical context), and reflect (goal-oriented assessment). This division clarifies the agent's internal logic, especially in multi-turn tasks.
Visual Zoom-In Tool-Use: Rather than relying on global predictions from low-resolution whole-screen processing, the agent predicts an initial region and then performs a refined prediction on a cropped zoom-in image patch. This two-step procedure improves spatial localization and precision, which is particularly important for small target UI elements in dense or cluttered layouts.

These inference strategies are critical for small models, where token and feature capacity limitations otherwise yield degraded fine-grained prediction performance.

4. Reinforcement Learning with Verifiable Rewards

Reinforcement learning in Ferret-UI Lite is conducted via Group Relative Policy Optimization (GRPO) and is grounded in reward schemes that admit external verification:

Sparse and Dense Reward Decomposition: For each sampled candidate output $z_i = (c_i, a_i)$ , where $c_i$ denotes chain-of-thought and $a_i = [\tau_i; \theta_i]$ (action type and parameter), the overall reward $r_i$ is:

$r_i = f_{\mathrm{type}}(\tau_i, \tau^{\mathrm{(gt)}}, \theta^{\mathrm{(gt)}}) + f_{\mathrm{param}}(\theta_i, \theta^{\mathrm{(gt)}})$

with $f_{\mathrm{type}}$ defined as:

$f_{\mathrm{type}}(\tau_i, \tau^{\mathrm{(gt)}}, \theta^{\mathrm{(gt)}}) = \begin{cases} 2 & \text{if } \tau_i = \tau^{\mathrm{(gt)}} \text{ and no params needed} (\theta^{\mathrm{(gt)}} = \emptyset)\ 1 & \text{if } \tau_i = \tau^{\mathrm{(gt)}} \text{ and params required}\ 0 & \text{otherwise} \end{cases}$

For location-based actions, the dense parameter reward is:

$f_{\mathrm{param}}^{\mathrm{dense}}(\theta_i, \theta^{\mathrm{(gt)}}) = \max\left\{1 - \lambda\left(\frac{|x_i - x^{\mathrm{(gt)}}|}{w} + \frac{|y_i - y^{\mathrm{(gt)}}|}{h}\right), 0\right\}$

where $\lambda$ is typically 0.5, and $w$ , $h$ are the width/height of the target element.

Online Filtering: Prompts yielding uniform (uninformative) rewards are filtered to skew the learning trajectory toward samples that challenge model ambiguity or strategy.

The combination of SFT initialization and RLVR fine-tuning allows explanation-augmented action planning, enhanced reward sensitivity, and task success optimization.

5. Benchmarks and Empirical Performance

Ferret-UI Lite exhibits competitive performance relative to similarly sized models on a variety of GUI grounding and navigation tasks:

Benchmark	Task Type	Ferret-UI Lite (3B) Score
ScreenSpot-V2	Grounding	91.6%
ScreenSpot-Pro	Grounding	53.3%
OSWorld-G	Grounding	61.2% (see note)
AndroidWorld	Navigation	28.0%
OSWorld	Navigation	19.8% (50 steps)

On ScreenSpot-Pro, Ferret‐UI Lite surpasses alternative 3B agents by over 15%. For navigation (AndroidWorld, OSWorld), metrics are success rates under fixed step budgets (typically 15–50). Performance is reported as averages over five runs and is shown to approach the capabilities of larger 7B–13B models, especially on grounding.

Note: Table 1 in the paper notes approximately 55.3% for OSWorld-G; the abstract states 61.2%. This suggests that cross-experiment variability exists but the agent remains among the top in its size category (Yang et al., 30 Sep 2025).

6. Key Insights and Lessons Learned

The construction and paper of Ferret-UI Lite yield several salient lessons:

Data Diversity: Incorporating data from real-world human annotation and synthetically generated rollouts (especially CoT) is decisive for achieving generalization across GUIs and task types.
Inference-Time Reasoning: Techniques such as plan–action–reflection output and visual zoom-in empower small-scale models to rival much larger agents on spatial precision tasks.
Training Pipeline: Two-phase training (SFT then RLVR) balances broad coverage with performance fine-tuning, but care in reward function design is required. In particular, small agents are highly sensitive to the structure of both sparse (categorical) and dense (coordinate) rewards.
Model Limitations: Despite high grounding performance, navigation in extended or more complex settings remains challenging for models under 3B parameters. This suggests an inherent trade-off between footprint and long-horizon reasoning capacity.

7. Context within the GUI Agent Ecosystem

Ferret-UI Lite synthesizes advances from previous work such as Lexi (Banerjee et al., 2023), which introduced self-supervised visio-linguistic UI representations, and the Ferret-UI and Ferret-UI 2 frameworks (You et al., 8 Apr 2024, Li et al., 24 Oct 2024), which established scalable, cross-platform, and high-resolution perception for LLM-based agents. The evolution towards Lite agents reflects a pragmatic adaptation for edge and mobile settings, grounded in quantitative ablation studies and informed by the benchmarking practices in the broader UI understanding literature.

The development methodology, particularly the focus on chain-of-thought tool-use and RLVR, distinguishes Ferret-UI Lite from prior systems that rely more strictly on imitation or large-scale generative pretraining. The explicit probabilistic control structures and evaluation metrics adopted in Ferret-UI Lite align with the most current standards for on-device HCI agents.

Ferret-UI Lite thus embodies a resource-efficient approach to GUI understanding and manipulation, substantiated by targeted architectural, data, and training choices which collectively set a reference point for future compact and autonomous UI agents (Yang et al., 30 Sep 2025).