Introduction
The pursuit of autonomy and adaptability in mobile device agents has gained momentum with the advent of Multimodal LLMs (MLLM). However, the integration of visual perception in these agents presents significant challenges. Current state-of-the-art MLLMs like GPT-4V show limitations in connecting semantic understanding with precise visual perception, especially in mobile device operation contexts. Consequently, prior solutions have relied extensively on device-specific system files such as XML, which often face accessibility issues. This underlines a critical gap in realizing truly adaptable and system-agnostic mobile agents.
Mobile-Agent Architecture
To close this gap, a novel approach has emerged through the development of Mobile-Agent, a framework designed to provide an autonomous mobile device agent powered by visual perception capabilities. The architecture of Mobile-Agent pivots on a visual perception system consisting of detection and Optical Character Recognition (OCR) models. This empowers the agent to thoroughly dissect and comprehend the front-end interface of mobile apps solely based on screenshots, dispensing with the requirement for backend system file access. The intricate interplay of these models with the MLLM core enables the agent to undertake precise localization—identifying both icons and text—thus allowing accurate interaction with mobile user interfaces.
The operation suite defined within Mobile-Agent encompasses fundamental tasks, including opening apps and textual or icon-based navigation, augmented with advanced operations such as back navigation, exiting, and self-termination upon task completion. Importantly, the framework incorporates a self-planning algorithm that intelligently interprets screenshots in concert with the user’s instructions and operational history. Together with a self-reflection mechanism, Mobile-Agent can review its actions, course-correct erroneous steps, and ensure the execution of complex multi-step tasks.
Performance Evaluation
The efficacy of Mobile-Agent was rigorously tested via Mobile-Eval, a benchmark specifically developed for this purpose encompassing 10 popular apps and a spectrum of tasks varying in complexity. A comprehensive evaluation across this benchmark yielded promising results, attesting to the agent's high success rates and operational precision even when navigating multifaceted tasks and multi-app operations. Remarkably, Mobile-Agent's ability to execute tasks was not far from that of human-level performance, speaking volumes about its potential to transform mobile device interaction.
Discussion and Future Directions
By situating Mobile-Agent within the larger context of LLM-based agents, this work delineates a leap forward in enabling LLMs to handle mobile devices adeptly. Unlike previous agents that augment GPT-4V's capability through device system metadata, Mobile-Agent maintains a pure vision-centric approach, ensuring greater portability and efficiency across different operating systems and environments.
The autonomous operational dexterity of Mobile-Agent positions it as a significant contribution to the domain, with Mobile-Eval serving as a testament to the feasibility of such autonomously guided agents in complex mobile navigation and task execution. The open sourcing of Mobile-Agent’s code and model presents an opportunity for community-wide enhancement and expansion, setting the stage for further innovation and application in mobile agent technology.