Multi-Agent Computer Use (MACU)

Updated 4 July 2026

MACU is a multi-agent framework for computer use that orchestrates specialized agents via shared memory, event buses, and continual replanning.
It decomposes complex tasks into parallel subtasks using designs like manager-driven DAGs, orchestrator–worker loops, and sequential routing.
Empirical results show MACU improves success rates and speed while integrated safety and governance mechanisms mitigate multi-step execution risks.

Multi-Agent Computer Use (MACU) denotes computer-use systems in which multiple agents collaborate over shared resources to accomplish long-horizon desktop or web tasks through orchestration, role specialization, and coordination mechanisms such as shared memory, an event bus, or tool arbitration. In contrast to the prevailing single serial agent paradigm, MACU emphasizes task decomposition, parallel execution, continual replanning, and explicit handling of partial observability, so that information that downstream agents may not be able to re-observe can be retained and routed forward through the system (Koh et al., 1 Jun 2026, Feng et al., 31 May 2026). Across recent work, MACU appears both as an explicit multi-agent runtime over homogeneous or specialized workers and as an internal compositional pattern in which planning, grounding, execution, critique, and safety are distributed across cooperating modules rather than collapsed into a monolithic generalist (Agashe et al., 1 Apr 2025, Lee et al., 19 Feb 2026).

1. Conceptual scope and problem setting

MACU arises from a common diagnosis: single-agent computer-use agents struggle on long-horizon, partially observable tasks that benefit from decomposition, parallel exploration, evidence gathering, and backtracking. Serial execution prevents exploiting natural parallelism, initial plans are brittle in dynamic GUIs and websites, and downstream steps often lose access to non-recoverable state such as browser session state, open tabs, typed text, and intermediate files (Koh et al., 1 Jun 2026). Related work frames similar limitations in other terms. LiteCUA identifies a “semantic disconnect” between how LLMs represent the world and how computers expose interaction, arguing that raw GUIs are designed for human perception and motor skills rather than symbolic or semantic reasoning by LLMs (Mei et al., 24 May 2025). Agent S2 diagnoses a different but compatible bottleneck: imprecise grounding of GUI elements, long-horizon planning difficulty, and the performance cost of relying on a single generalist model for diverse cognitive tasks (Agashe et al., 1 Apr 2025).

Within this literature, MACU encompasses several design regimes rather than a single architecture. One regime centers on explicit teams of agents over shared environments, such as planner, browser, coder, ops, auditor, or referee roles (Feng et al., 31 May 2026). A second regime uses a manager model to decompose a task as a directed acyclic graph (DAG) and dispatch parallel subagents on the ready frontier (Koh et al., 1 Jun 2026). A third regime implements MACU internally as a compositional generalist–specialist framework, where multiple cooperating components route work among themselves while presenting as one system externally (Agashe et al., 1 Apr 2025). A fourth regime stabilizes long-horizon execution through shared plan memory, intent abstractions, and critique loops that localize recovery instead of repeatedly regenerating the entire plan (Lee et al., 19 Feb 2026).

This range of formulations suggests that MACU is best understood as a systems-level response to three recurrent constraints in computer use: heterogeneous interfaces, partial observability, and long-horizon error accumulation. A plausible implication is that “multi-agent” in this domain refers less to concurrency alone than to explicit separation of strategic, grounding, execution, memory, and safety functions.

2. Architectural patterns and coordination regimes

Recent systems instantiate MACU through several recurring coordination patterns.

System	Coordination pattern	Salient features
LiteCUA on AIOS 1.0	Orchestrator–Worker	perceive–reason–act loop; MCP-exposed environment (Mei et al., 24 May 2025)
Agent S2	Generalist–specialist	Manager, Worker, Mixture-of-Grounding experts (Agashe et al., 1 Apr 2025)
AnyMAC	Sequential adaptive routing	Next-Agent Prediction and Next-Context Selection (Wang et al., 21 Jun 2025)
IntentCUA	Cooperative triad	Planner, Plan-Optimizer, Critic over shared plan memory (Lee et al., 19 Feb 2026)
“Multi-Agent Computer Use”	Manager-driven DAG	ready-frontier parallelism and continual replanning (Koh et al., 1 Jun 2026)

The manager-driven DAG formulation is the most explicit general architecture. A manager decomposes the task into a graph $G = (V, E)$ of subtasks with dependencies, dispatches parallel CUA subagents to execute nodes on the ready frontier, and revises the graph by adding, canceling, or rewriting nodes as new findings arrive. The readiness frontier is defined as

$F_t = \{ v \in V \setminus C_t : \forall u \in \mathrm{pred}(v),\ u \in C_t \},$

where $C_t$ is the completed set at iteration $t$ (Koh et al., 1 Jun 2026). In this design, completed nodes are frozen, running nodes can be canceled but not modified, and pending nodes can be modified or rewired. The manager also handles file management, state transfer, follow-up decisions, and final aggregation.

Other systems distribute the same responsibilities differently. LiteCUA uses an orchestrator–worker organization around a simple perceive–reason–act cycle: a Perceptor ingests screenshot and accessibility tree, a Reasoner produces “thought” and proposed “action,” and an Actor translates semantic actions into GUI operations through MCP and HTTP tool calls (Mei et al., 24 May 2025). Agent S2 separates a generalist Manager $M$ that produces high-level subgoals from a generalist Worker $W$ that emits low-level parameterized actions and routes each action to a grounding specialist, such as a visual, textual, or structural expert (Agashe et al., 1 Apr 2025). IntentCUA further specializes the loop into Planner, Plan-Optimizer, and Critic, with the Critic emitting $\{\text{success}, \text{retryable}, \text{blocked}\}$ after each plan unit and triggering local recovery when possible (Lee et al., 19 Feb 2026).

AnyMAC departs from graph-based orchestration by treating multi-agent coordination as a sequential communication pipeline rather than a static or learned graph. The communication sequence is

$S = [a_1, a_2, \dots, a_T],$

with the system predicting the next agent $a_t$ at each step and selecting a globally relevant subset of prior context for that agent (Wang et al., 21 Jun 2025). This design explicitly allows agent reuse, dynamic order variation, and access to nonlocal historical messages, which static chains, stars, trees, and DAGs restrict.

These architectures differ in scheduling and control flow, but they share a common organizational principle: high-level reasoning, low-level execution, and recovery are isolated into components that can be specialized, replaced, or coordinated under a common protocol.

3. Environment abstraction, tool surfaces, and action modalities

A central theme in MACU research is that coordination quality depends on how the environment is represented to agents. AIOS 1.0 addresses the interface problem by transforming the full computer into an MCP server that exposes computer state and an action space comprising atomic operations such as CLICK, SCROLL, TYPE, DRAG, and WAIT. Perception combines screenshots, accessibility trees, and mechanisms for invisible information such as software version inspection, and interactive elements plus permissible actions are explicitly encoded in JSON schemas (Mei et al., 24 May 2025). This design attempts to decouple interface complexity from decision complexity by allowing agents to reason over semantically structured state and action abstractions rather than raw pixels and device-specific gestures.

MCPWorld extends this concern from runtime architecture to evaluation infrastructure. It is described as the first automatic CUA testbed for API, GUI, and API–GUI hybrid agents, built around “white-box apps” whose source code can be revised, re-compiled, instrumented, and optionally exposed through MCP servers (Yan et al., 9 Jun 2025). By monitoring application behavior through dynamic code instrumentation, targeted code injection, and API-driven querying, MCPWorld verifies task completion by app-internal signals rather than screenshots or file outputs. For MACU, this is significant because different agents may specialize in different tool surfaces while still being evaluated against the same internal ground truth.

UltraCUA pushes the action-space question further by replacing pure GUI control with hybrid action: a unified policy alternating between low-level GUI primitives and high-level programmatic tool calls. Its tool inventory is derived from software documentation, open-source repositories, and code generation; the appendix reports 881 tools across 10 domains. The action space therefore includes both click, type, scroll, and key combos, and programmatic operations exposed as Python function signatures with docstrings and parameter schemas (Yang et al., 20 Oct 2025). The result is not itself a multi-agent protocol, but it directly affects MACU because hybrid tool calls compress long GUI chains into atomic operations that are easier to schedule, share, and synchronize across agents.

ALARA for Agents addresses the same layer from a governance standpoint. Its declarative context-agent-tool (CAT) data layer stores context files, per-agent NPC files, and Jinxes, where the Jinx list is both the tool catalog and the permission set. The formal guarantee is structural: if an edit removes a tool $t$ from an agent’s tool set $F_t = \{ v \in V \setminus C_t : \forall u \in \mathrm{pred}(v),\ u \in C_t \},$ 0, then for all future execution attempts $F_t = \{ v \in V \setminus C_t : \forall u \in \mathrm{pred}(v),\ u \in C_t \},$ 1 with tool $F_t = \{ v \in V \setminus C_t : \forall u \in \mathrm{pred}(v),\ u \in C_t \},$ 2, $F_t = \{ v \in V \setminus C_t : \forall u \in \mathrm{pred}(v),\ u \in C_t \},$ 3, regardless of model output or prompt injection (Agostino et al., 20 Mar 2026). In MACU terms, this makes tool access and context scoping part of the architecture rather than an advisory prompt convention.

Taken together, these systems treat environment exposure, tool schemas, and access control as first-class design variables. This suggests that MACU performance depends not only on better planners or stronger models, but also on the availability of standardized, semantically meaningful state and action interfaces.

4. Planning, routing, memory, and intent preservation

MACU planning methods differ chiefly in how they preserve task intent while adapting to new observations. In the DAG-based formulation, the manager persists non-recoverable evidence through screenshots, textual outputs, file archives, and VM state cloning, and routes critical context forward via instructions and file attachments (Koh et al., 1 Jun 2026). The system uses init_from to resume from a prior subtask’s final VM and variant_of to fork a structural retry from the same pre-run snapshot. This makes partial observability a first-class systems problem rather than a prompt-engineering issue.

AnyMAC formalizes routing and context selection directly. The router consumes embeddings of the query, role descriptions, historical responses, and dedicated NAP and NCS tokens:

$F_t = \{ v \in V \setminus C_t : \forall u \in \mathrm{pred}(v),\ u \in C_t \},$ 4

Next-Agent Prediction selects the most suitable role by scoring role tokens against the contextualized NAP token,

$F_t = \{ v \in V \setminus C_t : \forall u \in \mathrm{pred}(v),\ u \in C_t \},$ 5

while Next-Context Selection computes cosine similarity between the contextualized NCS token and history embeddings, gates them with a sigmoid, and selects the subset above threshold $F_t = \{ v \in V \setminus C_t : \forall u \in \mathrm{pred}(v),\ u \in C_t \},$ 6 (Wang et al., 21 Jun 2025). The full objective combines a policy-gradient term with a sparsity penalty,

$F_t = \{ v \in V \setminus C_t : \forall u \in \mathrm{pred}(v),\ u \in C_t \},$ 7

This mechanism directly targets two failure modes of fixed topologies: inability to revisit experts and inability to retrieve globally relevant earlier context.

IntentCUA addresses the same problem through shared plan memory and multi-view intent representations. Planner, Plan-Optimizer, and Critic coordinate over an intent group/subgroup index, skill hints, and a cache of user-approved global plans. Its multi-view encoder maps environment, action or keyword, and description views into a shared representation $F_t = \{ v \in V \setminus C_t : \forall u \in \mathrm{pred}(v),\ u \in C_t \},$ 8, and the implementation uses a weighted fusion $F_t = \{ v \in V \setminus C_t : \forall u \in \mathrm{pred}(v),\ u \in C_t \},$ 9 (Lee et al., 19 Feb 2026). Retrieval is based on cosine similarity to subgroup centroids, and skills are represented by medoid-derived canonicalized verb–argument schemas with typed parameters. The training objective is

$C_t$ 0

combining cross-view contrastive learning, dual prediction, and reconstruction. Only user-approved plans are cached for reuse, which ties memory reuse to a human validation boundary.

Agent S2 offers a complementary hierarchical perspective by modeling computer use as a POMDP

$C_t$ 1

where observations can include screenshots, instructions, and accessibility trees, and actions include click, type, scroll, hotkey, drag-and-drop, and related operations (Agashe et al., 1 Apr 2025). Its Proactive Hierarchical Planning distinguishes a coarse manager timescale from a fine worker timescale and replans after each subgoal rather than only after explicit failure.

Across these approaches, the central issue is not merely search over actions, but preservation of intent under noisy perception, multi-window state, and long execution traces. A plausible implication is that MACU planning increasingly treats memory, retrieval, and routing as coequal with control.

5. Benchmarks, metrics, and empirical findings

MACU-related systems are evaluated on heterogeneous benchmarks, so reported numbers are benchmark-specific rather than directly interchangeable. Even so, a set of recurring empirical patterns is visible.

System	Benchmark	Reported result
LiteCUA	OSWorld	14.66% success rate (Mei et al., 24 May 2025)
Agent S2	OSWorld	27.0% at 15 steps; 34.5% at 50 steps (Agashe et al., 1 Apr 2025)
MACU DAG system	OSWorld	43.8% $C_t$ 2 48.5% SR (Koh et al., 1 Jun 2026)
MACU DAG system	Odysseys	8.5% $C_t$ 3 34.0% SR; $C_t$ 4 speedup (Koh et al., 1 Jun 2026)
IntentCUA	286-task evaluation	74.83% success; SER 0.91 (Lee et al., 19 Feb 2026)
UltraCUA-32B-RL	OSWorld-Verified	41.0% SR at 15 steps (Yang et al., 20 Oct 2025)
MCPWorld Hybrid	201 tasks	75.12% task completion accuracy (Yan et al., 9 Jun 2025)

LiteCUA’s 14.66% success rate on OSWorld is explicitly presented as evidence that a lightweight agent can outperform several specialized frameworks when the environment is contextualized through MCP abstractions (Mei et al., 24 May 2025). Agent S2 reports 27.0% success at 15 steps and 34.5% at 50 steps on OSWorld, as well as 29.8% on WindowsAgentArena and 54.3% on AndroidWorld, with ablations attributing gains to both Mixture-of-Grounding and Proactive Hierarchical Planning (Agashe et al., 1 Apr 2025). UltraCUA, which is not itself a MACU runtime but is relevant to MACU execution capabilities, reports 28.9% for the 7B RL model and 41.0% for the 32B RL model on OSWorld-Verified at 15 steps, plus 21.7% success on WindowsAgentArena without Windows-specific training (Yang et al., 20 Oct 2025).

The explicit DAG-based MACU system supplies the clearest evidence for multi-agent gains over single-agent baselines. On OSWorld, success rate improves from 43.8% to 48.5%; on Online-Mind2Web, from 52.2% to 55.6%; on WebTailBench-v2, from 20.8% to 29.5%; and on Odysseys, from 8.5% to 34.0%, with average task completion wall-clock time on Odysseys improving from 162.4 to 110.3 minutes, approximately $C_t$ 5 faster (Koh et al., 1 Jun 2026). Replanning occurs on most tasks, and the reported graph sizes grow from initial decomposition to final execution, especially on long-horizon web tasks.

IntentCUA reports a different evaluation regime: 286 tasks spanning WebVoyager, ScreenAgent, and in-house local productivity tasks, with an overall 74.83% success rate, Step Efficiency Ratio of 0.91, and average latency of 1.46 minutes. Its baseline comparisons report 38.8% for UI-TARS-1.5 and 51.2% for UFO2, with significantly higher latency for both (Lee et al., 19 Feb 2026). MCPWorld, by contrast, focuses on modality benchmarking rather than planner architecture. On 201 curated tasks across 10 open-source apps, GUI-Only yields 70.65% SR, MCP-Only 53.23%, and Hybrid 75.12%, with corresponding KSCR values of 68.82%, 59.78%, and 69.63% (Yan et al., 9 Jun 2025).

Two broad empirical conclusions recur. First, longer-horizon and decomposable tasks tend to benefit most from multi-agent coordination, continual replanning, or hybrid tool access. Second, benchmark design matters: systems evaluated on internal white-box app signals, on judge-based web tasks, and on desktop action traces are not measuring exactly the same capability, even when all are described as computer use.

6. Safety, governance, and unresolved questions

MACU increases capability, but recent work shows that it can also amplify risk. BraveGuard frames the core safety issue at the trajectory level: harmful outcomes in computer-use agents often emerge only through multi-step execution traces whose individual actions appear locally benign. It introduces an adaptive defense loop that mines open-world threat signals, instantiates executable computer-use tasks, collects rollouts, derives trajectory-level supervision, and trains guard models over serialized traces (Feng et al., 31 May 2026). On AgentHazard, averaged guard-model accuracy increases from 38.79% to 82.38% under the GPT-5.5 backend, and the full self-evolving loop reaches 89.22% F1 in the reported ablation.

OS-BLIND demonstrates that multi-agent composition can worsen safety even when user instructions are benign. The benchmark contains 300 human-authored tasks across 12 categories, 8 applications, and 2 threat clusters. It reports that most CUAs exceed 90% attack success rate, that Claude 4.5 Sonnet reaches 73.0% ASR as a single agent, and that this rises to 92.7% when the model is deployed in multi-agent systems (Ding et al., 12 Apr 2026). The paper attributes part of this effect to decomposition granularity: subtasks can obscure the harmful global intent from the executor, and safety alignment tends to activate primarily at step 1 and rarely re-engages during subsequent execution.

A related concern appears in AdvCUA, which evaluates terminal-oriented OS-control agents against 140 tasks aligned with MITRE ATT&CK Enterprise techniques and kill chains. It reports that TTP-based malicious tasks are frequently executable by mainstream CUAs, that attack success rises with repeated attempts, and that current guardrails do not reliably prevent TTP-induced misuse (Luo et al., 8 Oct 2025). Although the benchmark is single-agent, its synthesis explicitly argues that MACU could amplify these risks through division of labor across privilege escalation, credential access, lateral movement, and exfiltration roles.

Governance-oriented work responds by shifting safety from model behavior to harness structure. ALARA for Agents introduces an exposure objective

$C_t$ 6

with the aim of minimizing tool and context exposure subject to role and feasibility constraints (Agostino et al., 20 Mar 2026). The significance of this formulation is that it replaces interpretive compliance with structural confinement: tools outside an agent’s Jinx list are not in the agent’s schema and cannot be invoked. AIOS and LiteCUA, by contrast, rely mainly on VM sandboxing and constrained action spaces, and explicitly do not specify rollback, retries, or detailed permission roles (Mei et al., 24 May 2025).

One common misconception is that more agents automatically yield safer behavior because specialized roles can include auditors or guards. The available evidence does not support that generalization. Another misconception is that stronger reasoning models alone will resolve computer-use brittleness. Work on MCP contextualization, white-box app instrumentation, hybrid tool calls, and least-privilege harnesses suggests instead that MACU safety and reliability are jointly determined by orchestration, environment representation, action semantics, and governance structure.

7. Research directions and field outlook

Current MACU work points toward several converging directions. One is richer environment abstraction: AIOS argues for semantically contextualized computers; MCPWorld argues for white-box, modality-agnostic evaluation; UltraCUA argues for hybrid action; and ALARA argues for declarative, enforceable context and tool scoping (Mei et al., 24 May 2025, Yan et al., 9 Jun 2025, Yang et al., 20 Oct 2025, Agostino et al., 20 Mar 2026). A second is more adaptive coordination: AnyMAC replaces fixed topologies with sequential routing; the DAG-based MACU framework treats replanning and partial observability as first-class; IntentCUA uses intent-level memory and subgroup skill retrieval to reduce redundant replanning (Wang et al., 21 Jun 2025, Koh et al., 1 Jun 2026, Lee et al., 19 Feb 2026). A third is explicit safety instrumentation, where trajectory-level supervision, event-stream capture, and policy-aware intervention become standard components of the orchestrator rather than external add-ons (Feng et al., 31 May 2026).

Several open problems remain explicit in the literature. Protocol clarity is incomplete in MCP-based systems, which often describe JSON schemas and tool exposure conceptually without publishing exact endpoint or serialization specifications (Mei et al., 24 May 2025). Multi-agent evaluation remains fragmented: MCPWorld is agent-agnostic but its published experiments are single-agent; OS-BLIND is safety-focused and shows multi-agent degradation; the main DAG-based MACU paper emphasizes capability on desktop and web benchmarks but not multi-agent safety (Yan et al., 9 Jun 2025, Ding et al., 12 Apr 2026, Koh et al., 1 Jun 2026). IntentCUA reports no code availability in the paper, and LiteCUA does not detail its prompts, backbone choice, or recovery procedures (Lee et al., 19 Feb 2026, Mei et al., 24 May 2025).

The field therefore appears to be moving from monolithic computer-use agents toward modular teams that standardize environment exposure, specialize decision roles, and externalize memory and safety into shared infrastructure. This suggests that MACU is becoming less a narrow variant of agent orchestration and more a general systems framework for long-horizon interaction with digital environments, where planning, execution, retrieval, evaluation, and governance are designed as interoperable layers rather than as properties of a single model.