Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms (2410.18967v1)

Published 24 Oct 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal LLM (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.

Overview of Ferret-UI 2: Advancements in Universal UI Understanding

The paper presents Ferret-UI 2, a multimodal LLM (MLLM) that enhances universal user interface (UI) understanding across diverse platforms. Ferret-UI 2 builds upon the original Ferret-UI by addressing significant limitations such as platform diversity, resolution variation, and data constraints. The model is specifically designed to operate across iPhones, Android devices, iPads, webpages, and AppleTV, introducing key innovations to improve adaptability and performance.

Key Innovations

Ferret-UI 2 stands out with three main advancements:

  1. Multi-Platform Support: The model extends compatibility beyond mobile devices to include a wider range of platforms. This feature allows it to scale and adapt seamlessly across different user environments, a crucial factor considering today's diverse platform landscape.
  2. Adaptive High-Resolution Perception: Utilizing an enhanced adaptive gridding mechanism, Ferret-UI 2 maintains its perception capabilities at the original resolution of UI screens. This ensures precise recognition of visual elements, leveraging Any-Resolution (AnyRes) methodology to support high-resolution image encoding efficiently.
  3. Advanced Data Generation: With improved multimodal training data generation powered by GPT-4o and the introduction of set-of-mark visual prompting, Ferret-UI 2 enhances spatial understanding and interaction capabilities. This approach rectifies the limitations of purely text-based prompting, enabling the model to achieve superior training data quality.

Empirical Evidence

Comprehensive empirical evaluations demonstrate the significant performance improvement of Ferret-UI 2 over its predecessor across multiple benchmarks. Key results include:

  • Superior performance on tasks related to referring, grounding, and user-centric interactions, with strong cross-platform transfer capabilities.
  • Enhanced accuracy in GUIDE next-action prediction and robust competitiveness against alternative models such as GPT-4o.

Cross-Platform Transferability

The model exhibits remarkable cross-platform generalization. Tests on platform-specific tasks reveal its ability to maintain high performance across varied resolutions and UI layouts. The adaptive gridding technique plays a pivotal role in retaining critical visual information while optimizing computational efficiency.

Implications and Future Directions

Ferret-UI 2's development represents a significant stride towards creating a universally applicable UI understanding model. Its ability to handle diverse platforms holds promise for integrating into complex ecosystems, particularly in the field of intelligent interfaces and assistive systems.

Looking forward, extending the model to incorporate additional platforms and refining its capabilities for complex multi-step UI interactions could result in even more versatile applications. The integration of more diverse datasets and further enhancement of the adaptive scaling mechanisms may unlock broader applicability in real-world scenarios.

In conclusion, Ferret-UI 2 exhibits considerable potential in advancing the field of user interface understanding. It effectively addresses the challenges posed by platform diversity and resolution variation, setting a strong foundation for future research and development in universal UI navigation systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Agent-e: From autonomous web navigation to foundational design principles in agentic systems. arXiv preprint arXiv:2407.13032, 2024.
  2. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413, 2024.
  3. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. arXiv preprint arXiv:2406.11896, 2024.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  5. Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264, 2024.
  6. Amex: Android multi-annotation expo dataset for mobile gui agents. arXiv preprint arXiv:2407.17490, 2024.
  7. Guide: Graphical user interface data for execution. arXiv preprint arXiv:2404.16048, 2024.
  8. Gui-world: A dataset for gui-oriented multimodal llm-based agents. arXiv preprint arXiv:2406.10819, 2024a.
  9. Websrc: A dataset for web-based structural reading comprehension. arXiv preprint arXiv:2101.09465, 2021.
  10. Mindsearch: Mimicking human minds elicits deep ai searcher. arXiv preprint arXiv:2407.20183, 2024b.
  11. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, pp.  845–854, 2017.
  14. Mind2web: Towards a generalist agent for the web. NeurIPS, 2024.
  15. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  16. Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108, 2023.
  17. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
  18. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024.
  19. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  20. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024.
  21. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553, 2024.
  22. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. arXiv preprint arXiv:2404.03648, 2024.
  23. Spotlight: Mobile ui understanding using vision-language models with a focus, 2023.
  24. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024a.
  25. On the effects of data scale on computer control agents. arXiv preprint arXiv:2406.03679, 2024b.
  26. Appagent v2: Advanced agent for flexible mobile interactions. arXiv preprint arXiv:2408.11824, 2024c.
  27. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  28. Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding? arXiv preprint arXiv:2404.05955, 2024b.
  29. Visualagentbench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327, 2024c.
  30. Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. arXiv preprint arXiv:2406.08451, 2024.
  31. Laser: Llm agent with state-space exploration for web navigation. arXiv preprint arXiv:2309.08172, 2023.
  32. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
  33. Mobileflow: A multimodal llm for mobile gui agent. arXiv preprint arXiv:2407.04346, 2024.
  34. Webcanvas: Benchmarking web agents in online environments. arXiv preprint arXiv:2406.12373, 2024.
  35. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024a.
  36. Androidinthewild: A large-scale dataset for android device control. NeurIPS, 2024b.
  37. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  38. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.
  39. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021.
  40. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. arXiv preprint arXiv:2406.01014, 2024a.
  41. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception, 2024b.
  42. Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents. arXiv preprint arXiv:2406.08184, 2024c.
  43. Autodroid: Llm-powered task automation in android. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking, pp.  543–557, 2024.
  44. Webui: A dataset for enhancing visual ui understanding with web semantics. ACM Conference on Human Factors in Computing Systems (CHI), 2023.
  45. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
  46. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024.
  47. Crab: Cross-environment agent benchmark for multimodal language model agents. arXiv preprint arXiv:2407.01511, 2024.
  48. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
  49. Webshop: Towards scalable real-world web interaction with grounded language agents. NeurIPS, 2022.
  50. Ferret: Refer and ground anything anywhere at any granularity, 2023.
  51. Ferret-ui: Grounded mobile ui understanding with multimodal llms. arXiv preprint arXiv:2404.05719, 2024.
  52. Ufo: A ui-focused agent for windows os interaction. arXiv preprint arXiv:2402.07939, 2024a.
  53. Mobile-env: an evaluation platform and benchmark for llm-gui interaction. arXiv preprint arXiv:2305.08144, 2023.
  54. Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning. arXiv preprint arXiv:2409.20566, 2024b.
  55. Ferret-v2: An improved baseline for referring and grounding with large language models. arXiv preprint arXiv:2404.07973, 2024c.
  56. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024a.
  57. Synapse: Trajectory-as-exemplar prompting with memory for computer control. In ICLR, 2023.
  58. Agentstudio: A toolkit for building general virtual agents. arXiv preprint arXiv:2403.17918, 2024b.
  59. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Zhangheng Li (6 papers)
  2. Keen You (7 papers)
  3. Haotian Zhang (107 papers)
  4. Di Feng (33 papers)
  5. Harsh Agrawal (20 papers)
  6. Xiujun Li (37 papers)
  7. Mohana Prasad Sathya Moorthy (4 papers)
  8. Jeff Nichols (3 papers)
  9. Yinfei Yang (73 papers)
  10. Zhe Gan (135 papers)
Citations (3)
Youtube Logo Streamline Icon: https://streamlinehq.com