Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Aria-UI: Visual Grounding for GUI Instructions (2412.16256v1)

Published 20 Dec 2024 in cs.HC and cs.AI

Abstract: Digital agents for automating tasks across different platforms by directly manipulating the GUIs are increasingly important. For these agents, grounding from language instructions to target elements remains a significant challenge due to reliance on HTML or AXTree inputs. In this paper, we introduce Aria-UI, a large multimodal model specifically designed for GUI grounding. Aria-UI adopts a pure-vision approach, eschewing reliance on auxiliary inputs. To adapt to heterogeneous planning instructions, we propose a scalable data pipeline that synthesizes diverse and high-quality instruction samples for grounding. To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding. Aria-UI sets new state-of-the-art results across offline and online agent benchmarks, outperforming both vision-only and AXTree-reliant baselines. We release all training data and model checkpoints to foster further research at https://ariaui.github.io.

Overview of Aria-UI: Visual Grounding for GUI Instructions

This paper addresses challenges faced by digital agents required to automate tasks within Graphical User Interfaces (GUIs) across heterogeneous platforms such as web, desktop, and mobile environments. The focal issue is the grounding, or translation, of language instructions into the interaction with target GUI elements, a complexity compounded by existing reliance on HTML or AXTree inputs, which often encounters limitations in terms of efficiency, accuracy, and adaptability across diverse task environments.

The authors introduce Aria-UI, a large-scale multimodal model tailored specifically for GUI grounding, which adopts an innovative pure-vision methodology to circumvent the drawbacks associated with HTML or AXTree reliance. A significant contribution of this research is the establishment of a scalable data pipeline designed to synthesize a large volume of high-quality and diverse grounding instruction samples.

Core Contributions

  1. Pure-Vision Grounding Approach: Aria-UI leverages a purely visual strategy, thereby enhancing flexibility and eliminating reliance on platform-specific inputs like HTML or AXTree that frequently vary in quality and coverage.
  2. Data Synthesis Pipeline: The research proposes a sophisticated pipeline for generating diverse data sets. This involves using advanced Large Multimodal Models (LMMs) to create captions and diverse instructions for GUI elements using an extensive web data source (Common Crawl). This pipeline enables the production of multimodal learning samples across distinct platforms (web, desktop, mobile), crucial for training robust vision-centric grounding models.
  3. Context-Aware Training: Aria-UI incorporates dynamic context awareness by integrating action histories in either text-only or text-image-interleaved format to enhance grounding capabilities in real-world scenarios where tasks are complex and involve multi-step interactions.

Empirical Evaluation

Aria-UI establishes new benchmarks in grounding performance, producing state-of-the-art results across a variety of testing scenarios, including offline and online agent tasks. In the single-step grounding evaluation using the ScreenSpot benchmark, Aria-UI achieved an average accuracy of 82.4%, the highest recorded in comparison with previous state-of-the-art methods. The model's context-aware variants further demonstrated superior performance in grounding tasks with dynamic scenarios such as those evaluated in AndroidControl and GUI-Odyssey tests. Furthermore, the integration of historical action data proved to enhance accuracy and task success rates, indicating that dynamic contextual information is critical for improving real-world applicability.

Practical and Theoretical Implications

Practically, the Aria-UI framework provides a significant advancement towards developing fully independent GUI agents capable of effective interaction across multiple platforms. The pure-vision approach proposed by the authors aligns well with industry trends moving towards platform-agnostic solutions that do not rely on manually curated or platform-dependent data formats like HTML or AXTree. Theoretically, this work emphasizes the necessity of a data-centric approach in advancing LMM performance in domain-specific applications, thereby encouraging further exploration into synthesizing diverse training data for complex UI scenarios.

Future Directions

The paper hints at future work targeting the integration of self-sufficient planning and grounding capabilities within a single model, which would reduce dependencies on external planners and enhance the adaptability of agents in unforeseen or complex task environments. Moreover, there is the potential for improving LMMs in correcting planner-generated instruction errors during complex task execution—a critical requirement for real-time applications. Given the rapid evolution of AI models, expanding upon Aria-UI's foundation to support broader and even more diverse GUI environments will be an exciting pathway for forthcoming research.

Through its innovative approach, Aria-UI sets a new standard for grounding tasks, showcasing how AI models can be further aligned with the intricacies of multimodal and dynamic interactions inherent in human-computer interfaces.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yuhao Yang (23 papers)
  2. Yue Wang (675 papers)
  3. Dongxu Li (40 papers)
  4. Ziyang Luo (35 papers)
  5. Bei Chen (56 papers)
  6. Chao Huang (244 papers)
  7. Junnan Li (56 papers)
Github Logo Streamline Icon: https://streamlinehq.com