Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 157 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 160 tok/s Pro
GPT OSS 120B 397 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Ferret-UI Lite: A Compact GUI Agent

Updated 1 October 2025
  • Ferret-UI Lite is a compact GUI agent that employs a two-stage training regime combining supervised fine-tuning with reinforcement learning to optimize multimodal interactions.
  • It integrates chain-of-thought reasoning and a visual zoom-in mechanism to enhance spatial localization and improve GUI grounding on resource-constrained devices.
  • The model leverages a diverse real and synthetic data mixture, achieving competitive performance in both GUI grounding and navigation tasks across various platforms.

Ferret-UI Lite is a compact, multimodal end-to-end GUI agent designed for on-device deployment and capable of operating across diverse platforms, including mobile, web, and desktop environments. Targeted at efficient and autonomous Human–Computer Interaction (HCI), its small-scale design (3B parameters) leverages a combination of advanced machine learning methodologies—ranging from data mixture engineering and chain-of-thought reasoning to reinforcement learning with verifiable rewards—while achieving competitive results on both GUI grounding and navigation tasks. Ferret-UI Lite represents an overview of insights obtained from larger UI-understanding models, optimized for scenarios where computational and memory constraints are paramount (Yang et al., 30 Sep 2025).

1. Model Architecture and Development Techniques

Ferret-UI Lite employs a lightweight, end-to-end, multimodal architecture engineered specifically for embedded systems. The agent realizes strong performance in its restricted parameter budget (3B) through several core technical strategies:

  • Two-stage Training Regime: The first stage consists of supervised fine-tuning (SFT) over a heterogeneous GUI data mixture that encodes diverse user interactions and visual layouts. The second stage applies reinforcement learning (RL) with verifiable rewards (RLVR), enabling domain adaptation and improved action reliability.
  • Chain-of-Thought (CoT) Reasoning: At inference, the model outputs a sequence of "think–plan–act" traces, generating not just actions but also intermediate textual reasoning—such as planning statements, action-specific analyses, and reflective self-assessments—which improves long-horizon navigation robustness.
  • Visual Tool-Use with Zoom-In Mechanism: For fine-grained grounding, the agent predicts an action region, then crops the input image to focus on this region and runs a secondary prediction. This approach mimics human attention and reduces the need for full-resolution, global processing.
  • Unified Action Representation: Actions are encoded as function calls with fixed parameter structures, rather than as free-form text. This design facilitates direct mapping to tool-use interfaces and improves both interpretability and downstream extraction.

This combination of techniques is directly tailored to maximize utility of a small parameter footprint while retaining flexibility and applicability across platform types and interface modalities.

2. Data Mixture Design

The training data for Ferret-UI Lite integrates real and synthetic sources in a unified annotation schema, thereby maximizing the model's ability to generalize across domains:

  • Real Data Sources: Benchmarks such as GroundUI, OSAtlas, UGround, Aria-UI, Aguvis, WaveUI, ShowUI, Jedi, and AgentNet supply cross-platform human-annotated data, covering diverse bounding and pointing schema. All annotations are normalized to a point-based representation:

(xcenter,ycenter)=(xmin+xmax2,ymin+ymax2)(x_{\mathrm{center}}, y_{\mathrm{center}}) = \left(\frac{x_{\min} + x_{\max}}{2}, \frac{y_{\min} + y_{\max}}{2}\right)

  • Synthetic Data Generation: Synthetic samples expand coverage by creating denser high-resolution grounding layouts (via screenshot stitching), generating chain-of-thought traces using GPT-4 with visual input, and rewriting navigation goals into QA pairs. Additionally, a multi-agent online simulator generates action rollouts, deliberately injecting errors and recovery paths to enrich the navigation experience.

The diversity of this data mixture not only enhances the agent's grounding and navigation capabilities but also equips the model to handle previously unseen interaction scenarios during inference.

3. Inference-Time Enhancements

Several techniques are employed at inference to bolster the model's accuracy and robustness:

  • Multi-step Chain-of-Thought Reasoning: The agent’s output is split across three reasoning segments—plan (succinct action intent), action think (analysis of UI elements and historical context), and reflect (goal-oriented assessment). This division clarifies the agent's internal logic, especially in multi-turn tasks.
  • Visual Zoom-In Tool-Use: Rather than relying on global predictions from low-resolution whole-screen processing, the agent predicts an initial region and then performs a refined prediction on a cropped zoom-in image patch. This two-step procedure improves spatial localization and precision, which is particularly important for small target UI elements in dense or cluttered layouts.

These inference strategies are critical for small models, where token and feature capacity limitations otherwise yield degraded fine-grained prediction performance.

4. Reinforcement Learning with Verifiable Rewards

Reinforcement learning in Ferret-UI Lite is conducted via Group Relative Policy Optimization (GRPO) and is grounded in reward schemes that admit external verification:

  • Sparse and Dense Reward Decomposition: For each sampled candidate output zi=(ci,ai)z_i = (c_i, a_i), where cic_i denotes chain-of-thought and ai=[τi;θi]a_i = [\tau_i; \theta_i] (action type and parameter), the overall reward rir_i is:

ri=ftype(τi,τ(gt),θ(gt))+fparam(θi,θ(gt))r_i = f_{\mathrm{type}}(\tau_i, \tau^{\mathrm{(gt)}}, \theta^{\mathrm{(gt)}}) + f_{\mathrm{param}}(\theta_i, \theta^{\mathrm{(gt)}})

with ftypef_{\mathrm{type}} defined as:

ftype(τi,τ(gt),θ(gt))={2if τi=τ(gt) and no params needed(θ(gt)=) 1if τi=τ(gt) and params required 0otherwisef_{\mathrm{type}}(\tau_i, \tau^{\mathrm{(gt)}}, \theta^{\mathrm{(gt)}}) = \begin{cases} 2 & \text{if } \tau_i = \tau^{\mathrm{(gt)}} \text{ and no params needed} (\theta^{\mathrm{(gt)}} = \emptyset)\ 1 & \text{if } \tau_i = \tau^{\mathrm{(gt)}} \text{ and params required}\ 0 & \text{otherwise} \end{cases}

For location-based actions, the dense parameter reward is:

fparamdense(θi,θ(gt))=max{1λ(xix(gt)w+yiy(gt)h),0}f_{\mathrm{param}}^{\mathrm{dense}}(\theta_i, \theta^{\mathrm{(gt)}}) = \max\left\{1 - \lambda\left(\frac{|x_i - x^{\mathrm{(gt)}}|}{w} + \frac{|y_i - y^{\mathrm{(gt)}}|}{h}\right), 0\right\}

where λ\lambda is typically 0.5, and ww, hh are the width/height of the target element.

  • Online Filtering: Prompts yielding uniform (uninformative) rewards are filtered to skew the learning trajectory toward samples that challenge model ambiguity or strategy.

The combination of SFT initialization and RLVR fine-tuning allows explanation-augmented action planning, enhanced reward sensitivity, and task success optimization.

5. Benchmarks and Empirical Performance

Ferret-UI Lite exhibits competitive performance relative to similarly sized models on a variety of GUI grounding and navigation tasks:

Benchmark Task Type Ferret-UI Lite (3B) Score
ScreenSpot-V2 Grounding 91.6%
ScreenSpot-Pro Grounding 53.3%
OSWorld-G Grounding 61.2% (see note)
AndroidWorld Navigation 28.0%
OSWorld Navigation 19.8% (50 steps)

On ScreenSpot-Pro, Ferret‐UI Lite surpasses alternative 3B agents by over 15%. For navigation (AndroidWorld, OSWorld), metrics are success rates under fixed step budgets (typically 15–50). Performance is reported as averages over five runs and is shown to approach the capabilities of larger 7B–13B models, especially on grounding.

Note: Table 1 in the paper notes approximately 55.3% for OSWorld-G; the abstract states 61.2%. This suggests that cross-experiment variability exists but the agent remains among the top in its size category (Yang et al., 30 Sep 2025).

6. Key Insights and Lessons Learned

The construction and paper of Ferret-UI Lite yield several salient lessons:

  • Data Diversity: Incorporating data from real-world human annotation and synthetically generated rollouts (especially CoT) is decisive for achieving generalization across GUIs and task types.
  • Inference-Time Reasoning: Techniques such as plan–action–reflection output and visual zoom-in empower small-scale models to rival much larger agents on spatial precision tasks.
  • Training Pipeline: Two-phase training (SFT then RLVR) balances broad coverage with performance fine-tuning, but care in reward function design is required. In particular, small agents are highly sensitive to the structure of both sparse (categorical) and dense (coordinate) rewards.
  • Model Limitations: Despite high grounding performance, navigation in extended or more complex settings remains challenging for models under 3B parameters. This suggests an inherent trade-off between footprint and long-horizon reasoning capacity.

7. Context within the GUI Agent Ecosystem

Ferret-UI Lite synthesizes advances from previous work such as Lexi (Banerjee et al., 2023), which introduced self-supervised visio-linguistic UI representations, and the Ferret-UI and Ferret-UI 2 frameworks (You et al., 8 Apr 2024, Li et al., 24 Oct 2024), which established scalable, cross-platform, and high-resolution perception for LLM-based agents. The evolution towards Lite agents reflects a pragmatic adaptation for edge and mobile settings, grounded in quantitative ablation studies and informed by the benchmarking practices in the broader UI understanding literature.

The development methodology, particularly the focus on chain-of-thought tool-use and RLVR, distinguishes Ferret-UI Lite from prior systems that rely more strictly on imitation or large-scale generative pretraining. The explicit probabilistic control structures and evaluation metrics adopted in Ferret-UI Lite align with the most current standards for on-device HCI agents.


Ferret-UI Lite thus embodies a resource-efficient approach to GUI understanding and manipulation, substantiated by targeted architectural, data, and training choices which collectively set a reference point for future compact and autonomous UI agents (Yang et al., 30 Sep 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Ferret-UI Lite.