Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 229 tok/s Pro
GPT OSS 120B 428 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning (2410.18963v1)

Published 24 Oct 2024 in cs.AI and cs.CL

Abstract: LLMs and large multimodal models (LMMs) have shown great potential in automating complex tasks like web browsing and gaming. However, their ability to generalize across diverse applications remains limited, hindering broader utility. To address this challenge, we present OSCAR: Operating System Control via state-Aware reasoning and Re-planning. OSCAR is a generalist agent designed to autonomously navigate and interact with various desktop and mobile applications through standardized controls, such as mouse and keyboard inputs, while processing screen images to fulfill user commands. OSCAR translates human instructions into executable Python code, enabling precise control over graphical user interfaces (GUIs). To enhance stability and adaptability, OSCAR operates as a state machine, equipped with error-handling mechanisms and dynamic task re-planning, allowing it to efficiently adjust to real-time feedback and exceptions. We demonstrate OSCAR's effectiveness through extensive experiments on diverse benchmarks across desktop and mobile platforms, where it transforms complex workflows into simple natural language commands, significantly boosting user productivity. Our code will be open-source upon publication.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Gui-world: A dataset for gui-oriented multimodal llm-based agents. arXiv preprint arXiv:2406.10819, 2024a.
  3. Guicourse: From general vision language models to versatile gui agents. arXiv preprint arXiv:2406.11317, 2024b.
  4. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9313–9332, Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.505.
  5. World Wide Web Consortium. Core accessibility api mappings 1.1. https://www.w3.org/TR/core-aam-1.1/, 2018. Accessed: [Insert date of access].
  6. Mind2web: towards a generalist agent for the web. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.  28091–28114, 2023.
  7. Palm-e: an embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, pp.  8469–8488, 2023.
  8. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  9. Minedojo: building open-ended embodied agents with internet-scale knowledge. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pp.  18343–18362, 2022.
  10. Read anywhere pointed: Layout-aware gui screen reading with tree-of-lens grounding. arXiv preprint arXiv:2406.19263, 2024.
  11. Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108, 2023.
  12. Hierarchical finite state machines with multiple concurrency models. IEEE Transactions on computer-aided design of integrated circuits and systems, 18(6):742–760, 1999.
  13. Google Cloud. Cloud Vision API. https://cloud.google.com/vision. Accessed: October 5, 2023.
  14. Intelligent agents with llm-based process automation. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  5018–5027, 2024.
  15. Ds-agent: Automated data science by empowering large language models with case-based reasoning. In Forty-first International Conference on Machine Learning.
  16. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
  17. Metagpt: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations.
  18. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14281–14290, 2024.
  19. Llm-powered code vulnerability repair with reinforcement learning and semantic reward. arXiv preprint arXiv:2401.03374, 2024.
  20. Herding llamas: Using llms as an os module. arXiv preprint arXiv:2401.08908, 2024.
  21. Tanya Krzywinska. Being a determined agent in (the) world of warcraft: text/play/identity. In Videogame, player, text, pp.  101–119. Manchester University Press, 2024.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp. 19730–19742. PMLR, 2023.
  23. Batch jobs load balancing scheduling in cloud computing using distributional reinforcement learning. IEEE Transactions on Parallel and Distributed Systems, 35(1):169–185, 2024.
  24. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  25. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
  26. Automatic dataset construction (adc): Sample collection, data curation, and beyond. arXiv preprint arXiv:2408.11338, 2024b.
  27. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023a.
  28. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  2511–2522, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URL https://aclanthology.org/2023.emnlp-main.153.
  29. Self-refine: iterative refinement with self-feedback. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.  46534–46594, 2023.
  30. Llm agent operating system. arXiv preprint arXiv:2403.16971, 2024.
  31. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023.
  32. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
  33. Agent planning with world knowledge model. arXiv preprint arXiv:2405.14205, 2024.
  34. Android in the wild: a large-scale dataset for android device control. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.  59708–59728, 2023.
  35. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024.
  36. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  37. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  38. Real-time flying object detection with yolov8. arXiv preprint arXiv:2305.09972, 2023.
  39. Reflexion: language agents with verbal reinforcement learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.  8634–8652, 2023.
  40. Mmac-copilot: Multi-modal agent collaboration operating system copilot. arXiv preprint arXiv:2404.18074, 2024.
  41. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023.
  42. Meta-prompting: Enhancing language models with task-agnostic scaffolding. arXiv preprint arXiv:2401.12954, 2024.
  43. Towards general computer control: A multimodal agent for red dead redemption ii as a case study. arXiv preprint arXiv:2403.03186, 2024.
  44. A survey on self-evolution of large language models. arXiv preprint arXiv:2404.14387, 2024.
  45. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  46. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  47. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024a.
  48. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2609–2634, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.147. URL https://aclanthology.org/2023.acl-long.147.
  49. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024b.
  50. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.  34153–34189, 2023c.
  51. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997, 2023d.
  52. Officebench: Benchmarking language agents across multiple applications for office automation. arXiv preprint arXiv:2407.19056, 2024c.
  53. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  54. Gui action narrator: Where and when did that action take place? arXiv preprint arXiv:2406.13719, 2024a.
  55. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
  56. Mitigating write disturbance in non-volatile memory via coupling machine learning with out-of-place updates. In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp.  1184–1198. IEEE, 2024b.
  57. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024c.
  58. Proteingpt: Multimodal llm for protein property prediction and structure understanding. arXiv preprint arXiv:2408.11363, 2024.
  59. Large multimodal agents: A survey. arXiv preprint arXiv:2402.15116, 2024a.
  60. Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634, 2023.
  61. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024b.
  62. Rewoo: Decoupling reasoning from observations for efficient augmented language models. arXiv preprint arXiv:2305.18323, 2023.
  63. Crab: Cross-environment agent benchmark for multimodal language model agents. arXiv preprint arXiv:2407.01511, 2024.
  64. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023a.
  65. Intercode: standardizing and benchmarking interactive coding with execution feedback. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.  23826–23854, 2023b.
  66. Gpt4tools: teaching large language model to use tools via self-instruction. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.  71995–72007, 2023c.
  67. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023d.
  68. Mihalis Yannakakis. Hierarchical state machines. In IFIP International Conference on Theoretical Computer Science, pp.  315–330. Springer, 2000.
  69. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022a.
  70. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022b.
  71. Ferret-ui: Grounded mobile ui understanding with multimodal llms. arXiv preprint arXiv:2404.05719, 2024.
  72. Ufo: A ui-focused agent for windows os interaction. arXiv preprint arXiv:2402.07939, 2024a.
  73. Meta-task planning for language agents. arXiv preprint arXiv:2405.16510, 2024b.
  74. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
  75. Data-copilot: Bridging billions of data and humans with autonomous workflow. In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
  76. You only look at screens: Multimodal chain-of-action agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics ACL 2024, pp.  3132–3149, Bangkok, Thailand and virtual meeting, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-acl.186.
  77. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024a.
  78. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  79. Agentstudio: A toolkit for building general virtual agents. arXiv preprint arXiv:2403.17918, 2024b.
  80. Kaos: Large model multi-agent operating system. arXiv preprint arXiv:2406.11342, 2024.
  81. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pp.  2165–2183. PMLR, 2023.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 13 likes.

Upgrade to Pro to view all of the tweets about this paper: