OS Agents: Multimodal OS Automation

Updated 7 August 2025

OS Agents are multimodal AI systems that process textual, visual, and structured inputs to automate tasks on various operating systems.
They integrate techniques like foundation models, chain-of-thought planning, and reactive adjustments to execute precise device actions.
They combine robust safety protocols, adaptive learning, and personalized memory modules to ensure effective and secure automation.

An OS Agent is a (multi-modal) LLM (LLM/MLLM)-based agent system that leverages the operating system’s environment and interfaces (such as graphical user interfaces, device APIs, and file systems) to automate computer and mobile tasks with the aim of matching or surpassing expert-level human proficiency in real-world computing contexts. These agents interpret and act upon a combination of textual, visual, and structured inputs—processing everything from HTML/DOM trees and GUI screenshots to natural language commands—then generate and execute sequences of input operations (mouse, keyboard, touch, navigation, API calls) on live OS platforms. This class of systems forms an intersection of advances in foundation models, symbolic planning, human-computer interaction, and software automation. The following sections provide authoritative coverage of the fundamentals, enabling methodologies, evaluation protocols, practical challenges, and outlook for OS Agents, as synthesized from the recent survey and related literature (Hu et al., 6 Aug 2025).

1. Fundamental Concepts and Core Capabilities

OS Agents operate within the environments provided by general computing devices—spanning desktop operating systems (Windows, macOS, Linux), mobile platforms (Android, iOS), and web-based or hybrid contexts. Their definition hinges on the following compositional elements:

Environment: The range of device-level settings supported. These include physical desktops, emulated/virtualized smartphones, browsers, and even simulated or instrumented web applications. The agent must manage device, application, and workflow variability.
Observation Space: OS Agents consume rich data modalities:
- Textual descriptions: e.g., HTML code, DOM trees, system logs.
- Visual inputs: rendered GUI screenshots, OCR/visual object features.
- Structured representations: accessibility (a11y) trees, “Set-of-Marks” region annotations.
- Multimodal fusion: simultaneous processing of heterogeneous inputs (e.g., image + OCR).
Action Space: Operations include:
- Input device actions (mouse click/drag, keyboard typing/tap).
- Navigation and window management (scroll, switch, tab control).
- Higher-level system or application commands (API invocation, code execution).
Essential Capabilities:
- Understanding: Parsing and semantic mapping of complex, noisy, and heterogeneous OS/UI states (including small visual elements and compound structures).
- Planning: Decomposing abstract or open-ended tasks into sequenced actions, implemented via chain-of-thought planning or iterative policy modules (e.g., ReAct, CoAT).
- Grounding: Mapping intent (textual or visual) to concrete executable actions by identifying the relevant UI elements and parameters (e.g., coordinate extraction, structured path traversals).

These components frame OS Agents as embodied AI systems capable of end-to-end perception, reasoning, and robust enactment in real device settings, undergirded by the capabilities of foundation models adapted to software interaction (Hu et al., 6 Aug 2025).

2. Enabling Methodologies and Architectures

The construction of OS Agents centers on integrated architectural paradigms:

a. Domain-Specific Foundation Models

Base Architectures:
- Text-oriented LLMs directly parse and interpret textual/HTML inputs.
- MLLMs integrate vision encoders (e.g., ViT, CNN backbones) to process high-resolution GUI images, sometimes augmented with special “cross-attention” modules to improve small object/label resolution (e.g., CogAgent).
- Hybrids concatenate or fuse multi-scale and cross-modal signals, enabling fine-grained perception of embedded texts, icons, and pixel-level details.
Pre-training and Supervision:
- Foundation models undergo large-scale pre-training using both public (e.g., CommonCrawl) and synthesized datasets to learn GUI grounding, multimodal reasoning, and OCR.
- Supervised fine-tuning on curated human or trajectory-generated demonstrations, task success labels, and alignment data.
- Reinforcement learning (e.g., PPO, self-evolving curriculum RL) is applied to optimize agent policies for prolonged task success, error coping, and recovery.
Grounding and Planning Mechanisms:
- Grounding tasks require extracting and predicting (x, y) UI element coordinates, bounding boxes, or control locators.
- Planning frameworks employ both global (macro-step) chain-of-thought and local (reactive) adjustment—critical for dynamic and interactive desktop/mobile workflows.

b. Agent Frameworks

Modular Design: OS Agent frameworks are typically organized into Perception, Planning, Memory, and Action modules.
- Perception: Fuses multi-modal input and structured UI representations.
- Planning: Supports both global decomposition (chain-of-thought) and local correction through iterative feedback.
- Memory: Combines short-term (prior step histories) and long-term (user profile, past states) storage to enable reversible action and meta-level strategy learning.
- Action: Converts high-level plans to grounded system commands; may include explicit mapping for rare/extended operations.
Notable Implementations: OS-Copilot and AppAgent exemplify these architectures, frequently leveraging open-source and commercial foundation models (Hu et al., 6 Aug 2025).

3. Evaluation Protocols and Benchmarking

OS Agent assessment is grounded in multi-tiered frameworks that account for both step-wise and holistic task evaluation:

Objective Metrics:
- Element/operation accuracy (exact/matched action evaluation).
- Grounding accuracy (e.g., predicted bounding boxes within ground-truth).
- F1, BLEU, ROUGE/BERTScore for text or hybrid output evaluation.
- End-to-end task success rate, defined via custom environment scripts that validate final-state correctness.
- Efficiency statistics (step ratio, latency, token/call counts).
Subjective Metrics:
- Human or LLM judge evaluations for qualities such as relevance, coherence, and naturalness.
Step-Level vs. Task-Level:
- Step-level: Each action is scored (for debugging and granularity).
- Task-level: Aggregate performance reflects real-world usability.
Multi-Platform Benchmarks:
- OSWorld, Windows Agent Arena, OS-MAP, and others encapsulate open-ended, cross-application and cross-domain task suites with hundreds of real-world tasks (Xie et al., 11 Apr 2024, Bonatti et al., 12 Sep 2024, Chen et al., 25 Jul 2025).
- Tasks span tasks like application installation, file management, web form completion, workflow orchestration, with scoring protocols that consider both atomic operations and multi-step strategies.
- A diversity of evaluation environments ensures assessment against unseen applications, dynamic UI changes, and varying interaction modalities.

4. Safety, Security, and Robustness

Security and robustness emerge as central concerns in practical deployments of OS Agents.

Attack Vectors:
- Indirect prompt injection (e.g., WIPI), where adversaries embed malicious instructions into benign data channels such as emails, document comments, or web content.
- Adversarial image attacks, e.g., malicious image patches (MIPs) embedded in screenshots, backgrounds, or online content, which manipulate agent action selection at the vision-language level (Aichberger et al., 13 Mar 2025).
Specialized Benchmarks:
- OS-Harm introduces a structured benchmark with 150 tasks spread across deliberate misuse, prompt injection, and model misbehavior categories (Kuntz et al., 17 Jun 2025).
- RedTeamCUA employs a hybrid OS-Web adversarial sandbox for comprehensive evaluation of agent safety and resilience, with metrics such as Adversarial Success Rate (ASR) and Attempt Rate (AR) (Liao et al., 28 May 2025).
Proposed Defenses:
- Automated LLM-based semantic judges for post-execution audit of agent traces.
- Defensive prompt engineering (System Prompt Guardrails).
- Robustified action verification and increased reliance on memory/context consistency.
- Runtime guardrails, action confirmation, and restricted API exposure for sensitive operations.
- Adversarially augmented training and preprocessing pipelines to immunize against specific classes of perturbations.

Both quantitative (F1, ASR) and qualitative (violation categories, error traces) metrics are used to iteratively refine agent safety properties under open-world, adversarial, and uncertain conditions.

5. Open Challenges: Personalization, Memory, and Self-Evolution

Current limitations and targeted research directions are as follows:

Personalization:
- Integration of persistent, hierarchical memory to maintain user profiles and execution rules across sessions.
- Dynamic adaptation to individual preferences, task histories, and evolving device/application contexts.
- Ensuring privacy and selective memory retention, especially for personal or sensitive data.
Long-term and Scalable Memory:
- Modular architectures such as MemoryOS enable short-term, mid-term, and long-term memory integration, leveraging strategies analogous to OS memory management (dialogue-chain FIFO, heat-based priorities, segmented paging) (Kang et al., 30 May 2025).
- These hierarchies support more coherent, contextually aware, and efficient agent operation over extended dialogues and across sessions.
Self-Evolution:
- Continuous learning from interaction experience ("self-evolution"), accumulation of action/observation-reward statistics, and task-aware correction in changing environments.
- Employing curriculum reinforcement learning, self-instructed task synthesis, and reverse task mining (OS-Genesis) to improve data/experience diversity without manual supervision (Sun et al., 27 Dec 2024).
- Real-time confidence estimation and adaptive human-agent collaboration (e.g., OS-Kairos), allowing agents to act autonomously when confident while requesting human input or abstaining under uncertainty (Cheng et al., 26 Feb 2025).

A plausible implication is that fully mature OS Agents will maintain continual learning and safety even in non-stationary, multi-application, and adversarial settings, with autonomously updated memory and personalized interaction models.

6. Resources, Benchmarks, and Community Anchors

The field’s rapid progress is fueled by the proliferation of open resources:

Open-source Repositories: Projects such as os-agent-survey.github.io, OS-Copilot, OS-MAP, OSWorld, OS-Atlas, and OS-Harm provide codebases, model checkpoints, curated datasets, and extensible benchmarks (Hu et al., 6 Aug 2025, Wu et al., 12 Feb 2024, Chen et al., 25 Jul 2025, Xie et al., 11 Apr 2024, Wu et al., 30 Oct 2024, Kuntz et al., 17 Jun 2025).
Benchmark Tables: Multi-platform, cross-modality benchmarks enable reproducibility, systematic cross-agent evaluation, and precise capability breakdowns regarding planning, grounding, and safety.
Open Challenges: The field is moving toward deep integration with both academic inquiry and industry deployment, with open issues in model scaling, adaptive safety, effective personalization, long-term memory, and transparent evaluation methodology.

Further advances are expected through enhancements in agent modularity, compositionality, and hybrid planning/memory architectures, with continuous updates from the community and benchmark repositories.

7. Summary Table: Key Dimensions of OS Agents

Dimension	Description	Example References
Environment	Desktop, mobile, web, browser, app	(Xie et al., 11 Apr 2024, Chen et al., 25 Jul 2025)
Observation Space	Text, screenshots, DOM, a11y tree, SoM	(Hu et al., 6 Aug 2025, Wu et al., 30 Oct 2024)
Action Space	Mouse, keyboard, tap, navigation, code/API	(Chen et al., 25 Jul 2025, Mei et al., 25 Mar 2024)
Core Capabilities	Understanding, planning, grounding	(Hu et al., 6 Aug 2025)
Agent Framework	Perception, planning, memory, action modules	(Wu et al., 12 Feb 2024, Hu et al., 6 Aug 2025)
Safety	Adversarial prompt, image attacks, misuse, model misbehavior	(Kuntz et al., 17 Jun 2025, Liao et al., 28 May 2025)
Evaluation	Success, accuracy, robustness, F1, step/task-level	(Xie et al., 11 Apr 2024, Chen et al., 25 Jul 2025)

Conclusion

OS Agents constitute a foundational aspect of current and future digital automation, synthesizing advances in multimodal foundation models, planning, grounding, and human-computer interaction. Their architectures and evaluation methodologies are actively evolving to address challenges in robust perception, adaptive autonomy, reliability under attack, and personalized experience. Persistent open benchmarks, collaborative repositories, and cross-paradigm research are shaping the pace and direction of innovation in this domain (Hu et al., 6 Aug 2025).