Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

39 1

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms (2410.18967v1)

Published 24 Oct 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal LLM (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.

PDF HTML Abstract

Overview of Ferret-UI 2: Advancements in Universal UI Understanding

The paper presents Ferret-UI 2, a multimodal LLM (MLLM) that enhances universal user interface (UI) understanding across diverse platforms. Ferret-UI 2 builds upon the original Ferret-UI by addressing significant limitations such as platform diversity, resolution variation, and data constraints. The model is specifically designed to operate across iPhones, Android devices, iPads, webpages, and AppleTV, introducing key innovations to improve adaptability and performance.

Key Innovations

Ferret-UI 2 stands out with three main advancements:

Multi-Platform Support: The model extends compatibility beyond mobile devices to include a wider range of platforms. This feature allows it to scale and adapt seamlessly across different user environments, a crucial factor considering today's diverse platform landscape.
Adaptive High-Resolution Perception: Utilizing an enhanced adaptive gridding mechanism, Ferret-UI 2 maintains its perception capabilities at the original resolution of UI screens. This ensures precise recognition of visual elements, leveraging Any-Resolution (AnyRes) methodology to support high-resolution image encoding efficiently.
Advanced Data Generation: With improved multimodal training data generation powered by GPT-4o and the introduction of set-of-mark visual prompting, Ferret-UI 2 enhances spatial understanding and interaction capabilities. This approach rectifies the limitations of purely text-based prompting, enabling the model to achieve superior training data quality.

Empirical Evidence

Comprehensive empirical evaluations demonstrate the significant performance improvement of Ferret-UI 2 over its predecessor across multiple benchmarks. Key results include:

Superior performance on tasks related to referring, grounding, and user-centric interactions, with strong cross-platform transfer capabilities.
Enhanced accuracy in GUIDE next-action prediction and robust competitiveness against alternative models such as GPT-4o.

Cross-Platform Transferability

The model exhibits remarkable cross-platform generalization. Tests on platform-specific tasks reveal its ability to maintain high performance across varied resolutions and UI layouts. The adaptive gridding technique plays a pivotal role in retaining critical visual information while optimizing computational efficiency.

Implications and Future Directions

Ferret-UI 2's development represents a significant stride towards creating a universally applicable UI understanding model. Its ability to handle diverse platforms holds promise for integrating into complex ecosystems, particularly in the field of intelligent interfaces and assistive systems.

Looking forward, extending the model to incorporate additional platforms and refining its capabilities for complex multi-step UI interactions could result in even more versatile applications. The integration of more diverse datasets and further enhancement of the adaptive scaling mechanisms may unlock broader applicability in real-world scenarios.

In conclusion, Ferret-UI 2 exhibits considerable potential in advancing the field of user interface understanding. It effectively addresses the challenges posed by platform diversity and resolution variation, setting a strong foundation for future research and development in universal UI navigation systems.

PDF Markdown Bookmark Chat (Pro)

References (59)

Authors (10)

Zhangheng Li (6 papers)
Keen You (7 papers)
Haotian Zhang (107 papers)
Di Feng (33 papers)
Harsh Agrawal (20 papers)
Xiujun Li (37 papers)
Mohana Prasad Sathya Moorthy (4 papers)
Jeff Nichols (3 papers)
Yinfei Yang (73 papers)
Zhe Gan (135 papers)

Citations (3)

View on Semantic Scholar

Tweets

https://twitter.com/TheTuringPost/status/1851036501917958290

https://twitter.com/gm8xx8/status/1849634930185355525

https://twitter.com/martintreiber/status/1852290468257693901

YouTube

Show All Videos