Benchmarking Mobile Device Control Agents across Diverse Configurations

Published 25 Apr 2024 in cs.HC, cs.AI, and cs.LG | (2404.16660v2)

Abstract: Mobile device control agents can largely enhance user interactions and productivity by automating daily tasks. However, despite growing interest in developing practical agents, the absence of a commonly adopted benchmark in this area makes it challenging to quantify scientific progress. In this work, we introduce B-MoCA: a novel benchmark with interactive environments for evaluating and developing mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 131 common daily tasks. Importantly, we incorporate a randomization feature that changes the configurations of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing LLMs or multi-modal LLMs as well as agents trained with imitation learning using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness. Our source code is publicly available at https://b-moca.github.io.

Abstract PDF HTML Upgrade to Chat

References (59)

Citations (7)

View on Semantic Scholar

Summary

The paper introduces B-MoCA, a benchmark that assesses mobile control agents on 60 practical tasks using varied device configurations.
It employs rule-based evaluation to compare LLM-based, MLLM-based, and behaviorally cloned agents, highlighting differences in task performance.
The findings reveal performance gaps in complex, multi-step tasks, emphasizing the need for diverse training and improved agent strategies.

Introducing B-MoCA: A Benchmark for Evaluating Mobile Device Control Agents

Overview of B-MoCA

B-MoCA is a new benchmark designed specifically for evaluating mobile device control agents. It operates on Android emulators and tests agents across 60 practical tasks relevant to everyday mobile usage. A notable feature of B-MoCA is its ability to vary device configuration aspects, such as UI layouts and language settings, enabling a comprehensive assessment of an agent's generalization performance. This benchmark includes baselines with agents utilizing LLMs, multi-modal LLMs (MLLMs), and agents trained from scratch via behavioral cloning, which are tested against diverse, randomized mobile environments.

Experimental Design

The B-MoCA benchmark assesses agents on their ability to perform tasks such as alarm setting, brightness adjustments, and emergency calling across different device setups. The success of these tasks is determined via a rule-based detector, analyzing if the task is completed as intended. To ensure agents can operate over varied interface layouts, the randomization feature simulates real-world usage by altering icon placements, wallpapers, and more.

Agent Performance Insights

Various types of agents were tested, including:

LLM-based agents: These agents, including examples like GPT-4, generally performed well on simpler tasks but struggled with complex multi-step operations.
MLLM-based agents: These agents integrate both text and visual inputs. They demonstrated improved handling on certain tasks with visual complexity but still showed limitations in task sequences requiring precise actions.
Agents via behavioral cloning (BC): These agents directly interact with the UI and mimic expert behaviors, showing promising results, especially in environments similar to their training data. However, they experienced a drop in performance when faced with unfamiliar device configurations.

Analysis of Results

Generalization and Robustness: Agents employing LLMs and MLLMs showcased robust performance across different device configurations, especially in linguistic adaptability.
Challenges and Limitations: Both sets of agents faced difficulties with complex tasks involving meticulous sequential actions. MLLM agents also struggled to utilize visual inputs effectively in all scenarios, indicating a potential gap in their current training regimes.
Influence of Training Diversity: Agent performance correlates with the diversity of the training environments. Agents trained across more varied settings showed better performance, underscoring the importance of comprehensive training samples.

Future Directions

Amidst the insights gained, there are several areas identified for future research:

Enhancing Task Complexity Handling: Future work should focus on improving agent strategies for completing multi-step tasks and tasks that involve complex interactions such as text input.
Expanding Training Diversity: Increasing the variety of training environments can potentially boost the generalization capabilities of agents.
Experimenting with Training Approaches: Exploring different training paradigms, such as reinforcement learning or advanced fine-tuning techniques for foundation models, might lead to improvements in task performance.

Conclusion

The introduction of the B-MoCA benchmark provides a robust platform for developing and evaluating agents capable of mobile device control. This work highlights significant opportunities for future research directions that could ultimately lead to the deployment of more capable and reliable assistive technologies for everyday mobile interactions.

The detailed findings, comprehensive analysis of agent behaviors, and identification of limitations in the current approaches pave the way for targeted improvements in autonomous mobile device interaction technologies. As B-MoCA is designed to test a wide array of agent capabilities across realistic mobile user scenarios, it stands as a critical tool for advancing the research in mobile automation and agent design. The accompanying open source release supports reproducibility and further innovations in the field.

Markdown