Benchmarking Mobile Device Control Agents across Diverse Configurations
Abstract: Mobile device control agents can largely enhance user interactions and productivity by automating daily tasks. However, despite growing interest in developing practical agents, the absence of a commonly adopted benchmark in this area makes it challenging to quantify scientific progress. In this work, we introduce B-MoCA: a novel benchmark with interactive environments for evaluating and developing mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 131 common daily tasks. Importantly, we incorporate a randomization feature that changes the configurations of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing LLMs or multi-modal LLMs as well as agents trained with imitation learning using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness. Our source code is publicly available at https://b-moca.github.io.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Do as i can, not as i say: Grounding language in robotic affordances. In The Conference on Robot Learning, 2022.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Reading between the lines: Learning to map high-level instructions to commands. In Association for Computational Linguistics, 2010.
- Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Language models are few-shot learners. In Conference on Neural Information Processing Systems, 2020.
- Octopus v2: On-device language model for super agent. arXiv preprint arXiv:2404.01744, 2024.
- Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, 2020.
- Palm-e: An embodied multimodal language model. International Conference on Machine Learning, 2023.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Conference on Neural Information Processing Systems, 2022.
- Instruction-finetuned foundation models for multimodal web navigation. In International Conference on Learning Representations 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Learning to navigate the web. In International Conference on Learning Representations, 2019.
- Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
- Visual language maps for robot navigation. In International Conference on Robotics and Automation, 2023.
- A data-driven approach for learning to control computers. In International Conference on Machine Learning, 2022.
- Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 2020.
- Language models can solve computer tasks. Conference on Neural Information Processing Systems, 2023.
- Adam: A method for stochastic optimization. In International Conference for Learning Representations, 2017.
- Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
- Batch reinforcement learning. In Reinforcement learning: State-of-the-art, pp. 45–73. Springer, 2012.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Uinav: A maker of ui automation agents. arXiv preprint arXiv:2312.10170, 2023.
- Widget captioning: Generating natural language description for mobile user interface elements. In Conference on Empirical Methods in Natural Language Processing, 2020.
- Code as policies: Language model programs for embodied control. In International Conference on Robotics and Automation, 2023.
- Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018.
- Visual instruction tuning. In Conference on Neural Information Processing Systems, 2023.
- Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Reinforcement learning by reward-weighted regression for operational space control. In International Conference on Machine learning, 2007.
- Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Conference on Neural Information Processing Systems, 1988.
- Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
- Androidinthewild: A large-scale dataset for android device control. In Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Stefan Schaal. Learning from demonstration. Conference on Neural Information Processing Systems, 1996.
- From pixels to ui actions: Learning to follow instructions via graphical user interfaces. In Conference on Neural Information Processing Systems, 2023.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
- Appbuddy: Learning to accomplish tasks in mobile apps via reinforcement learning. In Canadian Conference on Artificial Intelligence, 2021.
- The distracting control suite–a challenging benchmark for reinforcement learning from pixels. arXiv preprint arXiv:2101.02722, 2021.
- Meta-gui: Towards multi-modal conversational agents on mobile gui. Conference on Empirical Methods in Natural Language Processing, 2022.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, 2019.
- Towards general computer control: A multimodal agent for red dead redemption ii as a case study. arXiv preprint arXiv:2403.03186, 2024.
- Domain randomization for transferring deep neural networks from simulation to the real world. In International Conference on Intelligent Robots and Systems, 2017.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021.
- Attention is all you need. In Conference on Neural Information Processing Systems, 2017.
- Voyager: An open-ended embodied agent with large language models. In Conference on Neural Information Processing Systems, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. In Conference on Neural Information Processing Systems, 2022.
- Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272, 2023.
- Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023a.
- V-irl: Grounding virtual intelligence in real life. arXiv preprint arXiv:2402.03310, 2024.
- Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023b.
- Webshop: Towards scalable real-world web interaction with grounded language agents. Conference on Neural Information Processing Systems, 2022.
- React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023.
- Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023.
- You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436, 2023.
- Mobile-env: A universal platform for training and evaluation of mobile interaction. arXiv preprint arXiv:2305.08144, 2023.
- Webarena: A realistic web environment for building autonomous agents. In International Conference on Learning Representations, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.