Ferret-UI 2: Universal UI Understanding Model
- Ferret-UI 2 is a multimodal large language model that enables universal UI understanding across mobile, tablet, web, and TV platforms.
- It integrates a CLIP ViT-L/14 based vision encoder with a Vicuna-13B language decoder and Adaptive N-Gridding to achieve benchmark-leading zero-shot performance.
- The model leverages programmatically generated advanced interaction datasets and refined annotation strategies to significantly improve both elementary and complex UI tasks.
Ferret-UI 2 is a multimodal LLM (MLLM) developed for universal user interface (UI) understanding across multiple platforms, spanning mobile devices (iPhone, Android), tablets (iPad), web pages, and television interfaces (Apple TV). It advances on its predecessor by introducing platform-agnostic modeling, adaptive high-resolution visual processing, and programmatically generated advanced interaction datasets, resulting in significant gains in both elementary and complex UI understanding tasks. Empirical evaluations demonstrate robust cross-platform transfer and benchmark-leading zero-shot results.
1. Motivation and Core Challenges
The design of universal UI models is constrained by three primary factors: platform diversity, resolution variation, and data scarcity.
- Platform Diversity: Each platform—ranging from smartphones (iPhone, Android) to tablets (iPad), web pages, and TV interfaces (Apple TV)—presents distinct widget types, input modalities, annotation availability, and user interaction paradigms.
- Resolution Variation: Screens fluctuate from low-DPI devices (e.g., 828×1792 for phones) to high-resolution tablets (2048×2732) and HDTVs (1920×1080), necessitating models that capture global context and fine-grained local features with equal competence, regardless of input shape.
- Data Limitations: High-quality, platform-specific annotations are rare. Earlier methods typically employ OCR- or detector-based proposals, lacking human-verified spatial precision. For advanced tasks (e.g., natural-language QA about submission workflows), prior datasets rely on coarse, text-only descriptions, omitting crucial spatial and contextual details.
These challenges motivate Ferret-UI 2’s architectural and data-centric innovations.
2. Model Architecture
Ferret-UI 2 builds upon the Ferret-UI any-resolution multimodal pipeline, structured as follows:
- Vision Encoder: Utilizes CLIP ViT-L/14 for two parallel processes:
- A low-resolution global feature embedding of the full screenshot.
- High-resolution local feature extraction by systematic subdivision (“gridding”) of the image into overlapping crops.
- Multimodal Fusion: All image embeddings (both global and local) are concatenated and prepended to the input sequence of the LLM. A “Visual Sampler” then attends to these features, responding adaptively to user instructions to output either textual descriptions or localized spatial coordinates.
- Language Decoder: Implements Vicuna-13B as the default text generation back-end, with experimental variants including Gemma-2B and Llama3-8B, for producing semantic labels, action descriptions, or dialogue.
Enhancements in Ferret-UI 2 relative to its predecessor include the adoption of Adaptive N-Gridding (see section 3), replacement of detector-based boxes with human-verified or HTML view-hierarchy annotations for precise spatial supervision, and a comprehensive upgrade of instruction-tuning data for both elementary and advanced tasks.
3. Technical Innovations
a. Multi-Platform Representation
Ferret-UI 2 processes UIs from iPhone, Android, iPad, Web, and Apple TV using a single parameter-sharing backbone. Widget labels from all sources are mapped into a unified 13-way class taxonomy (e.g., Checkbox, Button, TextField, Toggle). During training, class imbalance and domain skew are addressed through per-platform loss weighting and deliberate over-generation of advanced interaction tasks for under-represented platforms (iPad, Apple TV).
b. Adaptive High-Resolution Encoding
To reconcile computational efficiency with fine-grained feature extraction, the Adaptive N-Gridding module decorrelates tile grid geometry from the underlying screenshot’s aspect ratio and resolution. For an input of dimensions and a base crop size ,
are computed as ideal (real-valued) tile counts. Integer grid counts are selected such that to minimize the joint distortion,
ensuring minimal deviation from the native image structure under a compute budget. Each cell is rescaled to before encoding, retaining critical local features and spatial consistency.
c. Generation of Advanced Instruction Data
Sophisticated, user-centric tasks are synthesized using GPT-4o in conjunction with Set-of-Mark (SoM) visual prompting. SoM overlays each widget with a uniquely colored corner marker and numeric tag in the screenshot. Given these annotated images and templated task prompts, GPT-4o generates:
- Comprehensive Descriptions: A global UI summary with granular breakdown by region or widget.
- Multi-Round Perception QA: Layered queries and follow-up questions regarding UI functionality, state, or layout (e.g., “What does the ‘Filter’ button do?”).
- Multi-Round Interaction QA: User intent instructions (e.g., “Please help me confirm submission”), requiring the model to specify widget tags for action rather than locations.
This procedure yields a Core-set comprising 528,000 iPhone images, 321,000 web UI pages, 19,000 iPad, and 16,000 Apple TV screenshots, with an enriched advanced task sampling protocol (three tasks per iPad/Apple TV, one per phone/web).
4. Training Methodology
Ferret-UI 2 is trained in a multi-task supervised framework using the following objectives:
- Token Generation (Classification): For OCR, widget type, tappability, and answer generation, cross-entropy loss is applied:
- Bounding Box Regression: For grounding tasks, the smooth-L1 loss is used between predicted and ground truth :
- The aggregate loss is a weighted blend, with weights tuned to address per-platform and per-task imbalances.
5. Empirical Evaluation
a. Benchmark Datasets and Tasks
Ferret-UI 2 is evaluated using several structured benchmarks:
- Core Benchmarks: 5 platforms × 9 subtasks. Elementary tasks include OCR, widget classification, tappability, widget listing, and search (by text or type). Advanced tasks encompass comprehensive UI description, multi-round perception QA, and multi-round interaction QA.
- GUIDE Dataset: Live web page next-action prediction.
- GUI-World: Multi-platform (iOS, Android, Web; zero-shot) benchmark.
b. Evaluation Metrics
A variety of standard metrics quantify performance:
- Accuracy (): For referencing and classification,
- Intersection-over-Union (IoU): For spatial grounding,
- Multi-IoU: Average across all matches, counting unmatched pairs as ,
- GPT-4o Score: Automated scalar rating (0–100 or 0–5) assigned by GPT-4V as judge.
- BertScore: For semantic similarity in GUIDE actions.
c. Principal Results
Performance is summarized in the following table (averaged over five platforms):
| Model | Elem-Refer (%) | Elem-Ground (%) | Adv GPT-4o Score | Adv Multi-IoU (%) | GUIDE BertScore | GUIDE IoU (%) |
|---|---|---|---|---|---|---|
| Ferret-UI (Vicuna) | 64.15 | 57.22 | 45.81 | 18.75 | 41.15 | 26.91 |
| Ferret-UI 2 | 81.34 | 81.31 | 89.73 | 41.71 | 91.37 | 55.78 |
| GPT-4o (no SoM) | 56.47 | 12.14 | 77.73 | 7.06 | 75.31 | 9.64 |
| GPT-4o + SoM prompt | 87.91 | — | 84.33 | 7.36 | — | — |
On the GUI-World zero-shot benchmark, Ferret-UI 2 (Vicuna-13B) scores 2.948, outperforming Ferret-UI’s 2.638.
These results indicate that Ferret-UI 2 achieves substantial improvements over both its predecessor and generic large models, particularly in advanced, interactive tasks and cross-platform scenarios.
6. Cross-Platform Transfer and Generalization
Cross-platform zero-shot transfer was explored by training on one platform and evaluating on others:
- Models trained on iPhone data (112,000 images) transfer robustly to iPad (68.1% refer, 65.2% ground) and Android (71.2% refer, 63.1% ground), attributable to similar portrait layouts and data scale.
- Models trained on Apple TV or Web transfer poorly to mobile platforms (maximum 59% refer, 54% ground), due to landscape-oriented layouts and dissimilar content.
- Thus, iPhone, iPad, and Android constitute a transfer-friendly cluster, while Apple TV and Web are more isolated. This suggests that shared layout conventions and annotation densities play a critical role in transfer efficacy.
7. Conclusions and Prospective Directions
Ferret-UI 2 demonstrates that a single MLLM can robustly and universally understand diverse UI platforms by leveraging:
- Platform-agnostic training on unified widget spaces,
- Adaptive N-gridding for efficient high-resolution encoding,
- Human-verified annotations and high-quality advanced-task training data generated with visual prompting.
It approximately doubles advanced-task performance over Ferret-UI, achieves strong cross-platform zero-shot results, and establishes new state-of-the-art on GUI-World. Notable limitations include a focus on ASCII text (excluding other scripts) and under-represented coverage of certain device types (iPad, Apple TV). The authors propose future extensions encompassing desktop GUIs, multilingual annotation support, and integration of reinforcement learning frameworks for multi-step UI automation.