MobA: A Two-Level Agent System for Efficient Mobile Task Automation (2410.13757v1)

Published 17 Oct 2024 in cs.MA, cs.AI, cs.CL, and cs.HC

Abstract: Current mobile assistants are limited by dependence on system APIs or struggle with complex user instructions and diverse interfaces due to restricted comprehension and decision-making abilities. To address these challenges, we propose MobA, a novel Mobile phone Agent powered by multimodal LLMs that enhances comprehension and planning capabilities through a sophisticated two-level agent architecture. The high-level Global Agent (GA) is responsible for understanding user commands, tracking history memories, and planning tasks. The low-level Local Agent (LA) predicts detailed actions in the form of function calls, guided by sub-tasks and memory from the GA. Integrating a Reflection Module allows for efficient task completion and enables the system to handle previously unseen complex tasks. MobA demonstrates significant improvements in task execution efficiency and completion rate in real-life evaluations, underscoring the potential of MLLM-empowered mobile assistants.

PDF HTML Abstract

Overview of "-0.35 MobA: A Two-Level Agent System for Efficient Mobile Task Automation"

The paper introduces "MobA," an agent system that leverages multimodal LLMs (MLLMs) to enhance mobile task automation. MobA is structured around a two-level architecture comprising a Global Agent (GA) and a Local Agent (LA). This approach addresses the limitations encountered by traditional smart assistants and model-based screen agents, which often falter due to complex interfaces and inadequate decision-making capabilities.

Key Components and Methodology

Two-Level Agent Architecture: The Global Agent interprets commands and plans tasks by breaking them into simpler sub-tasks, whereas the Local Agent focuses on executing these actions via function calls. This architectural division mirrors human cognitive processes, allowing for more efficient multitasking and better system efficiency.
Task Decomposition and Execution: MobA employs a sophisticated task planning pipeline involving task decomposition, feasibility assessment, and result validation. Tasks are divided into sub-tasks, enabling the agent to handle complex commands through a structured, step-by-step approach. This results in significant improvements in task execution efficiency and completion rates.
Memory Module: MobA incorporates a multi-aspect memory system to enhance adaptability and reduce redundancy by learning from historical experiences. This includes not only task execution data but also user preferences and application-specific knowledge, providing a robust foundation for decision-making.
Double-Reflection Mechanism: This mechanism allows MobA to assess task feasibility before execution and evaluate success afterward, preventing ineffective actions and facilitating error correction.

Evaluation and Results

The paper reports MobA's evaluation using "MobBench," a test set with 50 real-life tasks varying in complexity. MobA achieved a milestone score rate of 66.2%, outperforming other baseline systems by a substantial margin. This underscores the efficacy of the two-level agent architecture and the integration of MLLM capabilities in task planning and execution.

Implications and Future Work

Theoretically, MobA's approach demonstrates how MLLMs can be effectively utilized in mobile automation tasks, providing a framework for intelligent agent systems that combine structured task decomposition with adaptive learning. Practically, MobA represents a significant advancement in mobile assistants, enhancing their ability to manage complex, real-world tasks.

Future developments could aim to optimize task decomposition algorithms, refine memory retrieval strategies, and enhance the system's capability to handle dynamic mobile environments. Furthermore, as MLLMs continue to evolve, their integration into systems like MobA could expand the potential of mobile assistants, providing more seamless and efficient user experiences.

In summary, MobA goes beyond traditional task automation systems by incorporating advanced reasoning, planning, and memory capabilities, setting a new standard for mobile task automation. This aligns with the growing need for more responsive and intelligent systems in mobile technology.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Zichen Zhu (17 papers)
Hao Tang (378 papers)
Yansi Li (4 papers)
Kunyao Lan (7 papers)
Yixuan Jiang (9 papers)
Hao Zhou (351 papers)
Yixiao Wang (25 papers)
Situo Zhang (9 papers)
Liangtai Sun (8 papers)
Lu Chen (244 papers)
Kai Yu (201 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/arXivGPT/status/1847705100791197769

https://twitter.com/arXivGPT/status/1848067594139353115

https://twitter.com/arXivGPT/status/1848432558448476638