Papers
Topics
Authors
Recent
Search
2000 character limit reached

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

Published 3 Jul 2024 in cs.HC, cs.AI, and cs.MM | (2407.17490v2)

Abstract: AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents which are capable of completing tasks by directly interacting with the graphical user interface (GUI) on mobile devices. AMEX comprises over 104K high-resolution screenshots from popular mobile applications, which are annotated at multiple levels. Unlike existing GUI-related datasets, e.g., Rico, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we finetune a baseline model SPHINX Agent and illustrate the effectiveness of AMEX.The project is available at https://yxchai.com/AMEX/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments. arXiv preprint arXiv:2104.08560, 2021.
  3. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024.
  4. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  5. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024.
  6. Tinghe Ding. Mobileagent: enhancing mobile control via human-machine interaction and sop integration. arXiv preprint arXiv:2401.04124, 2024.
  7. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024.
  8. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  9. Understanding html with large language models. arXiv preprint arXiv:2210.03945, 2022.
  10. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  11. Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems, 36, 2024.
  12. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  13. Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends® in Computer Graphics and Vision, 16(1-2):1–214, 2024.
  14. Mapping natural language instructions to mobile ui action sequences. arXiv preprint arXiv:2005.03776, 2020.
  15. Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802, 2018.
  16. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  17. Comprehensive cognitive llm agent for smartphone gui automation. arXiv preprint arXiv:2402.11941, 2024.
  18. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  19. Android in the wild: A large-scale dataset for android device control, 2023.
  20. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017.
  21. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
  22. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  23. Ugif: Ui grounded instruction following. arXiv preprint arXiv:2211.07615, 2022.
  24. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  25. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16133–16142, 2023.
  26. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
  27. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
  28. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
  29. Ferret-ui: Grounded mobile ui understanding with multimodal llms. arXiv preprint arXiv:2404.05719, 2024.
  30. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436, 2023.
  31. Ufo: A ui-focused agent for windows os interaction. arXiv preprint arXiv:2402.07939, 2024.
  32. Android in the zoo: Chain-of-action-thought for gui agents, 2024.
  33. Screen recognition: Creating accessibility metadata for mobile applications from pixels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systautoems, pages 1–15, 2021.
  34. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
  35. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
Citations (10)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.