Ferret-UI: Advanced Multimodal UI Framework

Updated 19 August 2025

Ferret-UI is a multimodal model suite that precisely grounds and interprets UI screenshots using advanced spatial representations and adaptive gridding.
It applies resolution-adaptive encoding and dynamic grid optimization to maintain detailed UI element fidelity across mobile and cross-platform interfaces.
The framework leverages instruction-following training on diverse datasets to support tasks from basic icon recognition to complex UI function inference, enabling practical automation and accessibility.

Ferret-UI refers to a suite of multimodal LLMs (MLLMs) and associated methodologies designed for advanced, grounded understanding, reasoning, and interaction with user interface (UI) screens across devices. The Ferret-UI family spans three major research thrusts: (1) refer-and-ground systems for flexible spatial interaction (You et al., 2023); (2) mobile interface comprehension with resolution-adaptive architectures (You et al., 8 Apr 2024); and (3) universal cross-platform UI models with dynamic gridding and advanced task training (Li et al., 24 Oct 2024). These developments position Ferret-UI as a leading framework for robust, precise UI understanding and multimodal human-machine interaction.

1. Foundations and Model Architecture

Ferret-UI models are extensions of the Ferret MLLM paradigm, which pioneered hybrid region representations for spatially referenced, open-vocabulary grounding in images. The original Ferret model combines discrete coordinates (tokenized into a fixed number of bins) and continuous visual features obtained from spatial-aware visual sampling applied to CLIP-ViT-L/14 encoded feature maps (You et al., 2023). For each spatial region, the model processes inputs of the form

$\{x_{\min}, y_{\min}, x_{\max}, y_{\max}, f_{\text{region}}\}$

where $f_{\text{region}}$ is a pooled, context-aware embedding extracted via a multistage cascade of farthest point sampling, local neighborhood aggregation, and learned feature fusion.

Ferret-UI (You et al., 8 Apr 2024) introduces an "any resolution" (anyres) module to handle varied aspect ratios and the fine granularity of objects in mobile UIs. Each UI screen is split into multiple sub-images—horizontally (portrait) or vertically (landscape)—to preserve detail. These sub-images, along with the global image, are encoded by the visual backbone and concatenated to form a rich multimodal input for the decoder-only LLM.

Ferret-UI 2 (Li et al., 24 Oct 2024) generalizes the architecture for universal platform coverage (iPhone, Android, iPad, Webpage, AppleTV). The adaptive gridding mechanism partitions input images into variable sub-image grids, optimizing grid sizes to minimize the combined aspect and pixel distortion: $\Delta_{\text{best}} = \Delta_{\text{aspect}} \times \Delta_{\text{pixel}}$ where grid parameters are selected to keep computational cost $N$ within budgets, ensuring high-resolution feature fidelity across diverse screen shapes.

2. Training Methodologies and Data Curation

Grounded UI understanding in Ferret-UI is achieved via instruction-following, multi-task training on diverse, hierarchically annotated datasets. The GRIT dataset (You et al., 2023) for the original Ferret model comprises 1.1M region grounding/referring samples, including hard negatives for spatial reasoning robustness.

Ferret-UI (You et al., 8 Apr 2024) trains on UI screenshots from RICO (Android) and AMP (iPhone), with UI element annotations extracted via a custom detector. Tasks span elementary (icon recognition, OCR, widget listing) and advanced (screen description, function inference, conversational interaction) categories. Elementary data is generated via templated prompts (GPT-3.5 Turbo), while advanced samples use GPT-4 for open-ended instruction generation, ensuring rich task diversity and scalable annotation.

Ferret-UI 2 (Li et al., 24 Oct 2024) addresses data limitations by combining human-collected bounding box annotations with training data generated by GPT-4o using a set-of-mark (SoM) visual prompting scheme. In SoM, each UI element is overlaid with uniquely numbered minimal bounding boxes, explicitly encoding spatial correspondence for the model and the data generation LLM. This workflow systematically increases the quality and spatial clarity of advanced instruction-following examples, supporting robust user-centered UI understanding.

3. Unique Features and Innovations

Ferret-UI introduces several methodological and engineering advances beyond prior MLLMs:

Hybrid Spatial Representations: Integration of quantized coordinate tokens and pooled continuous region features allows precise grounding of arbitrarily shaped regions, from points and boxes to masks and scribbles (You et al., 2023).
Spatial-Aware Visual Sampling: A cascaded process of farthest point sampling, neighborhood gathering, and pooling recursively aggregates local visual features, supporting generalization to free-form, sparse, or dense region inputs.
Any-Resolution & Adaptive Gridding: Systematic splitting of UI images—either in fixed patterns (Ferret-UI) or dynamically optimized grids (Ferret-UI 2)—preserves fine-grained spatial details and avoids resolution distortion across platform and screen variations (Li et al., 24 Oct 2024).
Self-Contained UI Understanding: All models operate directly on raw UI screenshots, without dependency on XML, HTML, or auxiliary UI tree representations. This simplifies both training and deployment, enabling plug-and-play UI understanding from pixels alone.
Set-of-Mark Visual Prompting: In Ferret-UI 2, advanced data generation overlays numbered bounding boxes onto elements, explicitly aligning vision and language, which improves spatial grounding and user-centered instruction performance.

4. Evaluation and Empirical Performance

Ferret-UI models have been evaluated on multiple public and custom benchmarks designed for both elementary and advanced UI understanding:

Elementary Tasks: Include OCR, widget classification, icon recognition, and region grounding (measured via exact match and Intersection-over-Union).
Advanced Tasks: Encompass detailed (possibly multi-turn) UI descriptions, perception/interaction conversations, and function inference—scored by GPT-4(V)/GPT-4o on open-endedness, correctness, and grounding quality.
Platform Coverage: The Ferret-UI 2 empirical suite includes nine subtasks × five platforms, GUIDE next-action prediction for web screenshots, and zero-shot GUI-World generalization.

Performance outcomes:

Ferret-UI surpasses GPT-4V and other open-source MLLMs on most elementary UI tasks, exhibiting higher accuracy and grounding precision (You et al., 8 Apr 2024).
Ferret-UI 2 achieves state-of-the-art results across user-centric advanced tasks, with GPT-4o evaluation scores exceeding Ferret-UI by over 40 points in key metrics (e.g., 89.73 on advanced tasks with multi-IoU of 55.78 on GUIDE for Llama-3-8B backbone). Ablation studies confirm the additive effect of adaptive gridding and GPT-4o-SoM data on generalization and transfer (Li et al., 24 Oct 2024).
Platform transfer studies indicate robust zero-shot adaptation between iPhone, Android, and iPad, with certain domain gaps for highly distinct layouts (e.g., AppleTV landscape).

5. Applications and Implications

The capabilities of Ferret-UI position it as an enabling technology for a wide range of interface-driven applications:

Mobile Accessibility: Accurate spatial understanding and detailed element descriptions enable construction of assistive tools for visually impaired users.
Automated Testing and Usability: Integration with QA pipelines allows automated app testing and cross-device usability evaluation without manual element coding.
UI Automation and Navigation: Advanced reasoning and user-centered instruction following facilitate intelligent agents capable of navigation, form-filling, and troubleshooting directly on device screens.
Design and Synthesis Tools: Developers and designers can leverage Ferret-UI to extract UI structure, classify widgets, and infer functional intent from screenshots, streamlining iterative design and cross-platform adaptation.
Cross-modal Interactive Agents: The suite can be used to build agents that ground natural language queries in precise UI actions, forming the basis for explainable, multimodal digital assistants.

6. Limitations and Future Research

Current Ferret-UI models, while capable and extensible, have several open research fronts:

Platform and Orientation Generalization: Though Ferret-UI 2 improves cross-platform transfer, residual domain gaps remain for highly atypical orientations and platforms with rare or divergent layouts.
Data Generation Constraints: Synthesis of high-quality, spatially annotated, open-ended training data is non-trivial; continued improvements in visual prompt engineering and integration with high-capacity LLMs (e.g., GPT-4o) are critical.
Scalability to Larger Multimodal Backbones: Scaling transformer backbones and vision encoders could extend utility to even more complex, real-time, or large-scale UI systems, as well as richer multimodal fusion (e.g., video UIs).
UI/UX for Ferret-Driven Tools: While methodologies for UI dashboarding and interaction have been suggested (Editor's term: Ferret-UI dashboards), systematic user studies and toolkits with internal metric presentation (e.g., accuracy, grounding, IoU, GPT-4o evaluation) remain an open engineering effort.

7. Impact in the Broader MLLM Landscape

Ferret-UI constitutes a paradigm shift in grounded multimodal AI by integrating spatial reasoning, generalized UI comprehension, and user-centered interaction across visually diverse and structurally complex interfaces. Its sequential iterations—Ferret (region grounding), Ferret-UI (mobile adaptation, anyres), and Ferret-UI 2 (universal, adaptive gridding)—advance the frontiers of multimodal dialogue agents, automated UI understanding, and explainable interface automation. Empirical superiority demonstrated on GUIDE, GUI-World, and internal multi-platform benchmarks indicates the maturity and future potential of the Ferret-UI approach (You et al., 2023, You et al., 8 Apr 2024, Li et al., 24 Oct 2024).