OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning (2410.18963v1)
Abstract: LLMs and large multimodal models (LMMs) have shown great potential in automating complex tasks like web browsing and gaming. However, their ability to generalize across diverse applications remains limited, hindering broader utility. To address this challenge, we present OSCAR: Operating System Control via state-Aware reasoning and Re-planning. OSCAR is a generalist agent designed to autonomously navigate and interact with various desktop and mobile applications through standardized controls, such as mouse and keyboard inputs, while processing screen images to fulfill user commands. OSCAR translates human instructions into executable Python code, enabling precise control over graphical user interfaces (GUIs). To enhance stability and adaptability, OSCAR operates as a state machine, equipped with error-handling mechanisms and dynamic task re-planning, allowing it to efficiently adjust to real-time feedback and exceptions. We demonstrate OSCAR's effectiveness through extensive experiments on diverse benchmarks across desktop and mobile platforms, where it transforms complex workflows into simple natural language commands, significantly boosting user productivity. Our code will be open-source upon publication.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Gui-world: A dataset for gui-oriented multimodal llm-based agents. arXiv preprint arXiv:2406.10819, 2024a.
- Guicourse: From general vision language models to versatile gui agents. arXiv preprint arXiv:2406.11317, 2024b.
- SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9313–9332, Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.505.
- World Wide Web Consortium. Core accessibility api mappings 1.1. https://www.w3.org/TR/core-aam-1.1/, 2018. Accessed: [Insert date of access].
- Mind2web: towards a generalist agent for the web. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 28091–28114, 2023.
- Palm-e: an embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, pp. 8469–8488, 2023.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Minedojo: building open-ended embodied agents with internet-scale knowledge. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pp. 18343–18362, 2022.
- Read anywhere pointed: Layout-aware gui screen reading with tree-of-lens grounding. arXiv preprint arXiv:2406.19263, 2024.
- Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108, 2023.
- Hierarchical finite state machines with multiple concurrency models. IEEE Transactions on computer-aided design of integrated circuits and systems, 18(6):742–760, 1999.
- Google Cloud. Cloud Vision API. https://cloud.google.com/vision. Accessed: October 5, 2023.
- Intelligent agents with llm-based process automation. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 5018–5027, 2024.
- Ds-agent: Automated data science by empowering large language models with case-based reasoning. In Forty-first International Conference on Machine Learning.
- A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
- Metagpt: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations.
- Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14281–14290, 2024.
- Llm-powered code vulnerability repair with reinforcement learning and semantic reward. arXiv preprint arXiv:2401.03374, 2024.
- Herding llamas: Using llms as an os module. arXiv preprint arXiv:2401.08908, 2024.
- Tanya Krzywinska. Being a determined agent in (the) world of warcraft: text/play/identity. In Videogame, player, text, pp. 101–119. Manchester University Press, 2024.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp. 19730–19742. PMLR, 2023.
- Batch jobs load balancing scheduling in cloud computing using distributional reinforcement learning. IEEE Transactions on Parallel and Distributed Systems, 35(1):169–185, 2024.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
- Automatic dataset construction (adc): Sample collection, data curation, and beyond. arXiv preprint arXiv:2408.11338, 2024b.
- Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023a.
- G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2511–2522, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URL https://aclanthology.org/2023.emnlp-main.153.
- Self-refine: iterative refinement with self-feedback. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 46534–46594, 2023.
- Llm agent operating system. arXiv preprint arXiv:2403.16971, 2024.
- Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
- Agent planning with world knowledge model. arXiv preprint arXiv:2405.14205, 2024.
- Android in the wild: a large-scale dataset for android device control. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 59708–59728, 2023.
- Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Real-time flying object detection with yolov8. arXiv preprint arXiv:2305.09972, 2023.
- Reflexion: language agents with verbal reinforcement learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 8634–8652, 2023.
- Mmac-copilot: Multi-modal agent collaboration operating system copilot. arXiv preprint arXiv:2404.18074, 2024.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023.
- Meta-prompting: Enhancing language models with task-agnostic scaffolding. arXiv preprint arXiv:2401.12954, 2024.
- Towards general computer control: A multimodal agent for red dead redemption ii as a case study. arXiv preprint arXiv:2403.03186, 2024.
- A survey on self-evolution of large language models. arXiv preprint arXiv:2404.14387, 2024.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024a.
- Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2609–2634, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.147. URL https://aclanthology.org/2023.acl-long.147.
- A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024b.
- Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 34153–34189, 2023c.
- Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997, 2023d.
- Officebench: Benchmarking language agents across multiple applications for office automation. arXiv preprint arXiv:2407.19056, 2024c.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Gui action narrator: Where and when did that action take place? arXiv preprint arXiv:2406.13719, 2024a.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
- Mitigating write disturbance in non-volatile memory via coupling machine learning with out-of-place updates. In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 1184–1198. IEEE, 2024b.
- Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024c.
- Proteingpt: Multimodal llm for protein property prediction and structure understanding. arXiv preprint arXiv:2408.11363, 2024.
- Large multimodal agents: A survey. arXiv preprint arXiv:2402.15116, 2024a.
- Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634, 2023.
- Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024b.
- Rewoo: Decoupling reasoning from observations for efficient augmented language models. arXiv preprint arXiv:2305.18323, 2023.
- Crab: Cross-environment agent benchmark for multimodal language model agents. arXiv preprint arXiv:2407.01511, 2024.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023a.
- Intercode: standardizing and benchmarking interactive coding with execution feedback. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 23826–23854, 2023b.
- Gpt4tools: teaching large language model to use tools via self-instruction. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 71995–72007, 2023c.
- Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023d.
- Mihalis Yannakakis. Hierarchical state machines. In IFIP International Conference on Theoretical Computer Science, pp. 315–330. Springer, 2000.
- Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022a.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022b.
- Ferret-ui: Grounded mobile ui understanding with multimodal llms. arXiv preprint arXiv:2404.05719, 2024.
- Ufo: A ui-focused agent for windows os interaction. arXiv preprint arXiv:2402.07939, 2024a.
- Meta-task planning for language agents. arXiv preprint arXiv:2405.16510, 2024b.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
- Data-copilot: Bridging billions of data and humans with autonomous workflow. In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
- You only look at screens: Multimodal chain-of-action agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics ACL 2024, pp. 3132–3149, Bangkok, Thailand and virtual meeting, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-acl.186.
- Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024a.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Agentstudio: A toolkit for building general virtual agents. arXiv preprint arXiv:2403.17918, 2024b.
- Kaos: Large model multi-agent operating system. arXiv preprint arXiv:2406.11342, 2024.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pp. 2165–2183. PMLR, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.