CoAct-1: Computer-using Agents with Coding as Actions (2508.03923v1)

Published 5 Aug 2025 in cs.CL

Abstract: Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces a hybrid agentic framework that combines coding actions with GUI manipulation to enhance efficiency and reduce errors in task automation.
It employs a multi-agent system with an Orchestrator, Programmer, and GUI Operator to dynamically delegate tasks based on their specific requirements.
Experimental results on the OSWorld benchmark reveal significant improvements in success rates and step efficiency compared to traditional GUI-only agents.

CoAct-1: Computer-using Agents with Coding as Actions

Motivation and Background

The automation of computer tasks by autonomous agents has traditionally relied on GUI-based interaction, leveraging vision-LLMs (VLMs) to interpret and manipulate graphical interfaces. While this paradigm has enabled progress in open-ended task completion, it is fundamentally limited by the brittleness and inefficiency of long-horizon GUI action sequences. GUI agents are susceptible to visual grounding ambiguity, error accumulation over extended workflows, and the inherent constraints of pixel-level interaction. Recent modular approaches have introduced hierarchical planners to decompose tasks, but execution remains bottlenecked by GUI manipulation, leaving high-level planning disconnected from robust low-level control.

CoAct-1 introduces a hybrid agentic framework that augments GUI-based actions with direct programmatic execution, expanding the agent's action space to include coding (Python/Bash) as a first-class modality. This design enables agents to bypass inefficient GUI workflows for tasks amenable to scripting, while retaining GUI interaction for tasks requiring visual grounding or human-like manipulation.

Figure 1: Multi-agent system design for CoAct-1, featuring an Orchestrator, Programmer, and GUI Operator.

System Architecture

CoAct-1 is instantiated as a multi-agent system comprising three specialized agents:

Orchestrator: Serves as the central planner, decomposing user goals into subtasks and dynamically delegating execution to either the Programmer or GUI Operator based on subtask characteristics.
Programmer: Executes backend operations by generating and running Python or Bash scripts, interfacing with the OS via a remote code interpreter. The Programmer operates in multi-round dialogue with the code interpreter, iteratively refining scripts based on execution feedback.
GUI Operator: A VLM-based agent that performs frontend actions (mouse, keyboard) via GUI manipulation, interacting with the OS through a GUI action interpreter.

The agents maintain isolated conversation histories, with the Orchestrator aggregating summaries and screenshots from completed subtasks to inform subsequent planning. This separation ensures focused reasoning and prevents cross-agent contamination of context.

Figure 2: CoAct-1 workflow: Orchestrator delegates subtasks to Programmer (coding) or GUI Operator (GUI actions) based on task requirements.

Implementation Details

CoAct-1 is implemented atop the AG2 multi-agent orchestration framework, with backbone models selected for each agent to optimize performance:

Orchestrator: OpenAI o3 or o4-mini, selected for strong planning and reasoning capabilities.
Programmer: o4-mini, chosen for code generation and iterative refinement.
GUI Operator: OpenAI CUA 4o, a VLM finetuned for computer use.

The system interfaces with the OSWorld benchmark via an extended RESTful server, supporting remote code execution and GUI action interpretation. Task budgets are enforced via step limits (max rounds for each agent), with the overall system capped at 375 interactions per task.

Experimental Evaluation

Benchmark and Baselines

CoAct-1 is evaluated on the OSWorld benchmark, which comprises 369 tasks spanning productivity tools, IDEs, browsers, file managers, and multi-app workflows. Baselines include OpenAI CUA 4o, GTA-1, UI-TARS, and other state-of-the-art agentic frameworks.

Performance Metrics

Success Rate: Boolean task completion as determined by rule-based evaluators.
Efficiency: Average number of steps required for successful task completion.
Error Rate: Distribution of failures as a function of total actions.

Results

CoAct-1 establishes a new state-of-the-art on OSWorld, achieving a success rate of 60.76% in the 100+ step category, outperforming GTA-1 (53.10%) and OpenAI CUA 4o (31.40%). The system demonstrates pronounced gains in domains where programmatic control is advantageous, such as Calc (70.21%), VSCode (78.26%), and multi-app workflows (47.88%).

Figure 3: CoAct-1 achieves lower average steps per successful task compared to previous SOTA agentic frameworks, with higher accuracy than OpenAI CUA 4o.

Efficiency analysis reveals that CoAct-1 solves tasks in an average of 10.15 steps, compared to 15.22 for GTA-1 and 14.90 for UI-TARS. While OpenAI CUA 4o averages fewer steps (6.14), its success rate is substantially lower, indicating that CoAct-1's efficiency is coupled with greater robustness.

Additional ablation studies show that backbone selection for the Orchestrator and Programmer agents significantly impacts performance, with the best results obtained by pairing a powerful vision-centric GUI Operator (CUA 4o) with an upgraded Programmer (o4-mini).

Analysis and Discussion

Action Modality and Error Reduction

The hybrid action space enables CoAct-1 to strategically select between coding and GUI actions, reducing the total number of steps and minimizing error propagation. Coding actions are particularly effective in domains requiring batch operations, file management, and data processing, where a single script can replace a lengthy sequence of GUI manipulations. Empirical analysis shows a positive correlation between total actions and error rate, underscoring the importance of step minimization for reliability.

Failure Modes

Case studies highlight two primary sources of failure: high-level queries requiring conceptual inference beyond explicit instructions, and ambiguous queries lacking sufficient detail for disambiguation. These failures point to limitations in current LLM reasoning and context interpretation, suggesting avenues for future research in intent inference and context augmentation.

Practical and Theoretical Implications

CoAct-1 demonstrates that expanding the agentic action space to include coding yields substantial gains in both efficiency and reliability for computer automation. The multi-agent architecture facilitates modularity, enabling targeted improvements to individual agents (e.g., upgrading the Programmer for better code synthesis). The approach is scalable to heterogeneous environments and can be extended to support additional modalities (e.g., API calls, system-level hooks).

From a theoretical perspective, CoAct-1 provides evidence that hybrid action spaces mitigate the brittleness of pure GUI-based agents and enable more generalized, robust automation. The dynamic delegation mechanism implemented by the Orchestrator is a promising direction for adaptive agentic systems capable of reasoning about action modality selection.

Future Directions

Potential future developments include:

Integration of more advanced LLMs for improved intent inference and ambiguity resolution.
Extension to additional programming languages and system interfaces.
Incorporation of self-evolutionary reinforcement learning for agent specialization.
Application to broader domains beyond OS-level automation, such as web-based workflows and cloud environments.

Conclusion

CoAct-1 advances the state-of-the-art in computer-using agents by introducing coding as a core action modality, synergistically combining programmatic execution with GUI manipulation. The multi-agent system achieves superior performance and efficiency on the OSWorld benchmark, validating the efficacy of hybrid action spaces for generalized computer automation. The results suggest that future agentic frameworks should prioritize flexible action spaces and dynamic delegation to maximize robustness and scalability.

PDF Markdown

Follow-up Questions

Related Papers

Authors (12)

Tweets

https://twitter.com/Chi_Wang_/status/1953512210636112373

https://twitter.com/_akhaliq/status/1953804839185428666

https://twitter.com/fly51fly/status/1953581740678914177

https://twitter.com/arxivsanitybot/status/1953656661018570824

https://twitter.com/javaeeeee1/status/1954298758809764026

YouTube

Show All Videos

alphaXiv

CoAct-1: Computer-using Agents with Coding as Actions (56 likes, 0 questions)