A3: Android Agent Arena for Mobile GUI Agents (2501.01149v1)

Published 2 Jan 2025 in cs.AI

Abstract: AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of LLMs. Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at \url{https://yuxiangchai.github.io/Android-Agent-Arena/}.

PDF Abstract

Android Agent Arena (A3): A New Frontier for Evaluating Mobile GUI Agents

The paper presents the Android Agent Arena (A3), a novel evaluation system specifically designed for assessing the performance of mobile GUI agents in handling real-world tasks. A3 distinguishes itself by addressing key shortcomings in existing platforms that are primarily rooted in static frame-based evaluations. This new framework offers an enhanced and dynamic environment, enabling the rigorous assessment of mobile GUI agents, which can autonomously execute tasks and interact with diverse third-party applications on mobile devices.

Key Contributions

The research introduces several pioneering elements to the evaluation of mobile GUI agents:

Real-world Task Simulation: A3 integrates 201 tasks across 21 widely-used apps, covering common user scenarios. This breadth of application aims to replicate a more realistic user environment than previous datasets, which often confined app selection to a limited scope, such as Google or F-Droid apps.
Expanded Action Space: The system incorporates a more extensive action space, effectively accommodating agents trained on various datasets. This enhancement is vital for evaluating agents with different backgrounds and action capabilities, thereby increasing compatibility across models.
Automated Evaluation Process: A3 introduces a business-level, LLM-based evaluation framework which significantly reduces human labor. By leveraging capabilities inherent in advanced LLMs like GPT-4v, the evaluation process is streamlined, allowing for scalable and consistent assessments without extensive coding demand.

Methodology

A3 employs two distinct evaluation methodologies: predefined task-specific evaluation functions and a more generalized, scalable evaluation using LLMs. The task-specific functions rely on XML parsing and predefined conditions, enabling precise assessment. In contrast, the LLM-based approach allows for dynamic processing of information, demonstrating flexibility in adapting to unfamiliar tasks.

A3's architecture, built on Appium, serves as a bridge between GUI agents and Android devices. This setup facilitates comprehensive tracking and interaction within app environments, via screenshots and XML states, while providing a universal translator system to interconnect agent predictions with device actions.

Experimental Insights

The paper provides an empirical benchmark by training an agent on InternVL2-8B using the AMEX and AndroidControl datasets, evaluating on static frames and through the A3 system. Despite positive performance in static scenarios, the dynamic evaluation revealed significant challenges in real implementations, notably in the agent's inability to self-correct and manage cascading error effects driven by incomplete or irrelevant historical action records.

In contrast, business-level LLMs, augmented through 'Set-of-Mark' techniques, demonstrated the capacity to complete query tasks effectively, showcasing the broader cognitive flexibility of advanced LLMs even without bespoke dataset finetuning.

Implications and Future Directions

The A3 platform provides a comprehensive tool for understanding the capabilities and limits of mobile GUI agents in genuine use-case scenarios. Future developments may look towards refining LLMs to enhance precise action execution and better error correction mechanisms, an area where current agents exhibit deficiencies, particularly in complex multi-step tasks.

The paper suggests that the introduction of advanced evaluation mechanisms like those in A3 can catalyze further research into more adaptive and autonomous systems, potentially leading to more sophisticated applications in real-world mobile environments. The platform's flexibility and scalability promise to contribute significantly towards advancing autonomous systems and their interaction within the ubiquitous app ecosystems of modern mobile devices.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Yuxiang Chai (7 papers)
Hanhao Li (4 papers)
Jiayu Zhang (29 papers)
Liang Liu (237 papers)
Guozhi Wang (3 papers)
Shuai Ren (19 papers)
Siyuan Huang (123 papers)
Hongsheng Li (340 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1875040689144365314

https://twitter.com/rohanpaul_ai/status/1878351869711622351

https://twitter.com/arXivGPT/status/1875604148164465014

https://twitter.com/arXivGPT/status/1876329152217034880

https://twitter.com/GptMaestro/status/1877086318452220084

https://twitter.com/arXivGPT/status/1875966720562942201