Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Mobile Device Control Agents across Diverse Configurations (2404.16660v2)

Published 25 Apr 2024 in cs.HC, cs.AI, and cs.LG
Benchmarking Mobile Device Control Agents across Diverse Configurations

Abstract: Mobile device control agents can largely enhance user interactions and productivity by automating daily tasks. However, despite growing interest in developing practical agents, the absence of a commonly adopted benchmark in this area makes it challenging to quantify scientific progress. In this work, we introduce B-MoCA: a novel benchmark with interactive environments for evaluating and developing mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 131 common daily tasks. Importantly, we incorporate a randomization feature that changes the configurations of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing LLMs or multi-modal LLMs as well as agents trained with imitation learning using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness. Our source code is publicly available at https://b-moca.github.io.

Introducing B-MoCA: A Benchmark for Evaluating Mobile Device Control Agents

Overview of B-MoCA

B-MoCA is a new benchmark designed specifically for evaluating mobile device control agents. It operates on Android emulators and tests agents across 60 practical tasks relevant to everyday mobile usage. A notable feature of B-MoCA is its ability to vary device configuration aspects, such as UI layouts and language settings, enabling a comprehensive assessment of an agent's generalization performance. This benchmark includes baselines with agents utilizing LLMs, multi-modal LLMs (MLLMs), and agents trained from scratch via behavioral cloning, which are tested against diverse, randomized mobile environments.

Experimental Design

The B-MoCA benchmark assesses agents on their ability to perform tasks such as alarm setting, brightness adjustments, and emergency calling across different device setups. The success of these tasks is determined via a rule-based detector, analyzing if the task is completed as intended. To ensure agents can operate over varied interface layouts, the randomization feature simulates real-world usage by altering icon placements, wallpapers, and more.

Agent Performance Insights

Various types of agents were tested, including:

  • LLM-based agents: These agents, including examples like GPT-4, generally performed well on simpler tasks but struggled with complex multi-step operations.
  • MLLM-based agents: These agents integrate both text and visual inputs. They demonstrated improved handling on certain tasks with visual complexity but still showed limitations in task sequences requiring precise actions.
  • Agents via behavioral cloning (BC): These agents directly interact with the UI and mimic expert behaviors, showing promising results, especially in environments similar to their training data. However, they experienced a drop in performance when faced with unfamiliar device configurations.

Analysis of Results

  • Generalization and Robustness: Agents employing LLMs and MLLMs showcased robust performance across different device configurations, especially in linguistic adaptability.
  • Challenges and Limitations: Both sets of agents faced difficulties with complex tasks involving meticulous sequential actions. MLLM agents also struggled to utilize visual inputs effectively in all scenarios, indicating a potential gap in their current training regimes.
  • Influence of Training Diversity: Agent performance correlates with the diversity of the training environments. Agents trained across more varied settings showed better performance, underscoring the importance of comprehensive training samples.

Future Directions

Amidst the insights gained, there are several areas identified for future research:

  • Enhancing Task Complexity Handling: Future work should focus on improving agent strategies for completing multi-step tasks and tasks that involve complex interactions such as text input.
  • Expanding Training Diversity: Increasing the variety of training environments can potentially boost the generalization capabilities of agents.
  • Experimenting with Training Approaches: Exploring different training paradigms, such as reinforcement learning or advanced fine-tuning techniques for foundation models, might lead to improvements in task performance.

Conclusion

The introduction of the B-MoCA benchmark provides a robust platform for developing and evaluating agents capable of mobile device control. This work highlights significant opportunities for future research directions that could ultimately lead to the deployment of more capable and reliable assistive technologies for everyday mobile interactions.

The detailed findings, comprehensive analysis of agent behaviors, and identification of limitations in the current approaches pave the way for targeted improvements in autonomous mobile device interaction technologies. As B-MoCA is designed to test a wide array of agent capabilities across realistic mobile user scenarios, it stands as a critical tool for advancing the research in mobile automation and agent design. The accompanying open source release supports reproducibility and further innovations in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Do as i can, not as i say: Grounding language in robotic affordances. In The Conference on Robot Learning, 2022.
  3. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  4. Reading between the lines: Learning to map high-level instructions to commands. In Association for Computational Linguistics, 2010.
  5. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  6. Language models are few-shot learners. In Conference on Neural Information Processing Systems, 2020.
  7. Octopus v2: On-device language model for super agent. arXiv preprint arXiv:2404.01744, 2024.
  8. Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, 2020.
  9. Palm-e: An embodied multimodal language model. International Conference on Machine Learning, 2023.
  10. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Conference on Neural Information Processing Systems, 2022.
  11. Instruction-finetuned foundation models for multimodal web navigation. In International Conference on Learning Representations 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
  12. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  13. Learning to navigate the web. In International Conference on Learning Representations, 2019.
  14. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  15. Visual language maps for robot navigation. In International Conference on Robotics and Automation, 2023.
  16. A data-driven approach for learning to control computers. In International Conference on Machine Learning, 2022.
  17. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 2020.
  18. Language models can solve computer tasks. Conference on Neural Information Processing Systems, 2023.
  19. Adam: A method for stochastic optimization. In International Conference for Learning Representations, 2017.
  20. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.
  21. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  22. Batch reinforcement learning. In Reinforcement learning: State-of-the-art, pp.  45–73. Springer, 2012.
  23. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  24. Uinav: A maker of ui automation agents. arXiv preprint arXiv:2312.10170, 2023.
  25. Widget captioning: Generating natural language description for mobile user interface elements. In Conference on Empirical Methods in Natural Language Processing, 2020.
  26. Code as policies: Language model programs for embodied control. In International Conference on Robotics and Automation, 2023.
  27. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018.
  28. Visual instruction tuning. In Conference on Neural Information Processing Systems, 2023.
  29. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
  30. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  31. Reinforcement learning by reward-weighted regression for operational space control. In International Conference on Machine learning, 2007.
  32. Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Conference on Neural Information Processing Systems, 1988.
  33. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
  34. Androidinthewild: A large-scale dataset for android device control. In Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  35. Stefan Schaal. Learning from demonstration. Conference on Neural Information Processing Systems, 1996.
  36. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. In Conference on Neural Information Processing Systems, 2023.
  37. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  38. Appbuddy: Learning to accomplish tasks in mobile apps via reinforcement learning. In Canadian Conference on Artificial Intelligence, 2021.
  39. The distracting control suite–a challenging benchmark for reinforcement learning from pixels. arXiv preprint arXiv:2101.02722, 2021.
  40. Meta-gui: Towards multi-modal conversational agents on mobile gui. Conference on Empirical Methods in Natural Language Processing, 2022.
  41. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, 2019.
  42. Towards general computer control: A multimodal agent for red dead redemption ii as a case study. arXiv preprint arXiv:2403.03186, 2024.
  43. Domain randomization for transferring deep neural networks from simulation to the real world. In International Conference on Intelligent Robots and Systems, 2017.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  45. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021.
  46. Attention is all you need. In Conference on Neural Information Processing Systems, 2017.
  47. Voyager: An open-ended embodied agent with large language models. In Conference on Neural Information Processing Systems, 2023.
  48. Chain-of-thought prompting elicits reasoning in large language models. In Conference on Neural Information Processing Systems, 2022.
  49. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272, 2023.
  50. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.
  51. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023a.
  52. V-irl: Grounding virtual intelligence in real life. arXiv preprint arXiv:2402.03310, 2024.
  53. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023b.
  54. Webshop: Towards scalable real-world web interaction with grounded language agents. Conference on Neural Information Processing Systems, 2022.
  55. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023.
  56. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023.
  57. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436, 2023.
  58. Mobile-env: A universal platform for training and evaluation of mobile interaction. arXiv preprint arXiv:2305.08144, 2023.
  59. Webarena: A realistic web environment for building autonomous agents. In International Conference on Learning Representations, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Juyong Lee (17 papers)
  2. Taywon Min (2 papers)
  3. Minyong An (2 papers)
  4. Changyeon Kim (6 papers)
  5. Kimin Lee (69 papers)
  6. Dongyoon Hahm (2 papers)
  7. Haeone Lee (3 papers)
Citations (7)