Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 165 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 208 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents (2509.26539v1)

Published 30 Sep 2025 in cs.CV, cs.CL, and cs.LG

Abstract: Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of $91.6\%$, $53.3\%$, and $61.2\%$ on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of $28.0\%$ on AndroidWorld and $19.8\%$ on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.

Summary

The paper presents a novel end-to-end, small on-device GUI agent that integrates visual and language models for effective GUI grounding and navigation.
It employs supervised fine-tuning followed by reinforcement learning with verifiable rewards to optimize performance on diverse GUI tasks.
Empirical results show competitive performance against larger models, demonstrating practical potential for deployment on resource-constrained devices.

Ferret-UI Lite: Design and Empirical Analysis of Small On-Device GUI Agents

Introduction

Ferret-UI Lite addresses the challenge of building compact, end-to-end agents capable of robustly interacting with graphical user interfaces (GUIs) across mobile, web, and desktop platforms. The work is motivated by the need for on-device deployment, where constraints on model size, latency, and privacy preclude the use of large-scale, server-side models. The paper systematically explores data curation, model architecture, training strategies, and evaluation protocols for a 3B-parameter multimodal LLM, providing a comprehensive empirical paper and practical lessons for the development of small GUI agents.

Figure 1: An illustration of Ferret-UI Lite on a multi-step GUI navigation task. Human users prompt with a high-level goal in plain text, and the model autonomously interacts with GUI devices through tapping, scrolling, typing, etc., until the task is complete. At each step, the model observes the GUI screen, generates think-plan-act traces, and executes the action.

Model Architecture and Training Paradigm

Ferret-UI Lite employs a unified, end-to-end architecture that directly maps GUI screenshots and user instructions to low-level action policies. The model integrates a ViTDet-based image encoder with a LLM backbone, leveraging the AnyRes strategy for flexible input resolution partitioning. The action space is unified across platforms, supporting both shared and domain-specific operations via a function-call schema, which enhances interpretability and aligns with tool-use paradigms.

Figure 2: Model architecture and training recipes of Ferret-UI Lite. The model takes a GUI screen and the user instruction as inputs, and predicts chain-of-thought reasoning traces and a low-level action policy to control GUI devices in an end-to-end manner directly. The model is trained through supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR).

Training is conducted in two stages:

Supervised Fine-Tuning (SFT): The model is first trained on a mixture of human-annotated and synthetic datasets, covering both grounding and navigation tasks. Annotation formats are unified to ensure consistent supervision.
Reinforcement Learning with Verifiable Rewards (RLVR): RL is applied to further align the model with task success, using rule-based, automatically computable rewards for both grounding (containment-based) and navigation (action type and parameter matching, with both sparse and dense reward variants).

Data Curation and Synthetic Data Generation

A key insight is the necessity of large-scale, diverse data for small models to generalize effectively. The SFT data mixture includes public benchmarks and systematically generated synthetic data, spanning multiple platforms and interaction types.

Figure 3: Synthetic navigation data generation pipeline, which consists of offline data generation based on human-annotated trajectories, and online rollouts collection from a multi-agent system.

Synthetic data generation encompasses:

High-Resolution Grounding: Composite images with dense layouts to improve spatial localization.
Chain-of-Thought (CoT) Navigation: Synthetic reasoning traces (plan, action think, reflect) generated via GPT-4o and SoM visual prompting, supporting both short and long CoT variants.
Synthetic QA and Replanning: Visual QA pairs and perturbed trajectories to enhance robustness and recovery strategies.
Online Rollouts: Multi-agent systems generate diverse, error-prone trajectories, filtered by VLM-based judges for quality assurance.

Reinforcement Learning and Reward Design

RLVR is used to overcome the limitations of strict imitation in SFT, allowing the model to optimize for verifiable task success. For grounding, a containment-based reward is used, and a zoom-in operation is introduced to refine predictions on cropped regions, mimicking human attention mechanisms and improving accuracy on cluttered or high-resolution interfaces.

For navigation, the reward function decomposes correctness into action type and parameter fidelity, with dense rewards providing graded feedback based on normalized distance from ground-truth locations. Group Relative Policy Optimization (GRPO) is employed for stable and sample-efficient RL, with online filtering to focus on informative examples.

Empirical Results

GUI Grounding

Ferret-UI Lite achieves strong results on ScreenSpot-V2 (91.6%), ScreenSpot-Pro (53.3%), and OSWorld-G (55.3%), outperforming all other 3B models and narrowing the gap to 7B and larger models. Notably, on ScreenSpot-Pro, it surpasses UI-TARS-1.5 (7B) by over 15%. The model demonstrates robust performance across mobile, desktop, and web platforms, with high-resolution synthetic data providing significant gains on challenging benchmarks.

On AndroidWorld, Ferret-UI Lite attains a 28.0% success rate, competitive with 7B models. On OSWorld, it achieves 19.8% (with extended step limits), outperforming all 3B baselines. However, the absolute performance on long-horizon navigation remains limited compared to larger models, highlighting the inherent challenges of scaling down agentic reasoning and planning.

Figure 4: Successful completion for ``Make sparkline charts for each order id with the data from Jan to Mar in the Chart column.'' task in OSWorld evaluation.

Figure 5: Successful Completion for ``Delete the following recipes from Broccoli app: Kale and Quinoa Salad, Baked Cod with Lemon and Dill, Rasperry Almond Smoothie'' task in AndroidWorld evaluation.

Figure 6: Successful Completion for ``Create a new contact for Lina Muller. Their number is +15410831733'' task in AndroidWorld evaluation.

Ablation Studies

SFT vs RL: RL consistently improves grounding and navigation, with the zoom-in operation providing additional gains.
Data Mixture Ratios: Balanced mixtures of grounding and navigation data yield the best results, with mutual benefits observed.
Synthetic Data Scaling: Increasing the scale and diversity of synthetic data, especially with long CoT traces and online rollouts, leads to substantial improvements in navigation.
Reward Design: Dense grounding rewards combined with action type rewards are most effective; reward design is critical for small model RL stability and performance.

Figure 7: SFT vs RL variants.

Figure 8: Impact of SFT steps for RL.

Figure 9: Evolution of verifiable reward curve during RLVR training.

Practical and Theoretical Implications

Ferret-UI Lite demonstrates that with careful data curation, unified action schemas, and tailored RL reward structures, small on-device models can achieve competitive performance in GUI grounding and navigation. The results suggest that synthetic data generation, especially with diverse reasoning traces and online rollouts, is essential for compensating for limited model capacity. However, the persistent gap in long-horizon navigation underscores the need for further advances in compact model architectures, memory mechanisms, and hierarchical planning.

The work provides a reproducible recipe for building efficient, privacy-preserving GUI agents suitable for deployment on resource-constrained devices. The findings also inform the broader design of agentic systems, emphasizing the interplay between data diversity, reward engineering, and model architecture in low-resource settings.

Future Directions

Future research should explore:

Hierarchical and Modular Architectures: To better decompose long-horizon tasks and leverage limited capacity.
Continual and Online Learning: For adaptation to novel GUIs and user preferences in real-world deployment.
Advanced Memory and Planning Mechanisms: To improve multi-step reasoning and error recovery in small models.
Unified Evaluation Protocols: For more comprehensive benchmarking across platforms and task types.

Conclusion

Ferret-UI Lite establishes a strong empirical foundation for small, on-device GUI agents, demonstrating that with principled data curation, unified modeling, and reward-driven RL, compact models can approach the performance of much larger systems in grounding and navigation tasks. The work highlights both the promise and the current limitations of scaling down agentic AI, providing actionable insights and open challenges for the community.