CoAct-1: Hybrid Coding & GUI Automation

Updated 7 August 2025

The paper presents a hybrid agent system that integrates coding as an action with GUI control, achieving 60.76% task success and a 32% reduction in operation steps on the OSWorld benchmark.
CoAct-1 is a multi-agent system consisting of an Orchestrator, Programmer, and GUI Operator, each handling specific subtasks in complex computer automation.
The approach improves task reliability and efficiency by dynamically selecting between programmatic code execution and visual GUI manipulation for diverse automation scenarios.

CoAct-1 is a multi-agent autonomous computer-using system that introduces “coding as an action” alongside traditional GUI-based operation. Developed to overcome the limitations of agents that interact with computers purely via GUIs, CoAct-1 leverages a hybrid control paradigm in which an Orchestrator intelligently delegates user tasks to either a GUI Operator agent—capable of vision-language control of graphical environments—or a Programmer agent that can generate and execute Python or Bash scripts. This architectural innovation enables more efficient, reliable, and robust operation on complex, long-horizon tasks, as demonstrated by state-of-the-art results on the OSWorld benchmark (Song et al., 5 Aug 2025).

1. System Architecture

CoAct-1 consists of three principal components, each fulfilling a specialized role within the multi-agent system:

Component	Role	Output/Feedback
Orchestrator	Task decomposition, agent assignment, memory management	Delegation decision, memory updates
Programmer	Program synthesis and execution (Python/Bash)	Code outputs (file paths, summaries), screenshot
GUI Operator	Vision-language GUI manipulation	GUI operation summary, screenshot

The Orchestrator parses a user’s high-level instruction into well-defined subtasks, using a broad persistent memory of conversation history with both agents. For each subtask, it dynamically selects between code execution and GUI manipulation based on task type, efficiency, and observed system feedback.
The Programmer Agent is invoked when a task is best executed through direct programmatic interaction (typified by file management, batch data processing, or system configuration). It synthesizes Python or Bash code, executes it through an interpreter connected to the operating system, and iteratively engages in multi-round interaction with the Orchestrator to confirm results via receipts such as execution output and screenshots.
The GUI Operator acts as a traditional vision-language agent performing direct screen interaction through mouse, keyboard, and visual context. It executes detailed visual operations and returns summaries and context-specific screenshots.

The control flow is highly modular, with the Orchestrator selecting the most reliable and efficient execution method for each step. Diagrams in the original paper (Figures 1 and 2) represent this workflow explicitly as a branching decision tree routed by the Orchestrator.

2. Coding as a Core Action

CoAct-1’s defining innovation is the explicit integration of “coding as an action” within the agent's action set—a concept that transforms the operational profile for autonomous agents:

If the Orchestrator identifies that a subtask involves systematic data transformation, bulk file operations, or configuration changes, it dispatches the request to the Programmer.
The Programmer analyzes the current context, generates Python or Bash scripts, and submits them to a system interpreter. It then collects and forwards outputs (such as file paths, error logs, or screenshots) back to the Orchestrator.
This mechanism supplants long, repetitive, and often error-prone GUI operations with single- or few-step deterministic programmatic actions, dramatically reducing total action steps and error propagation.

In effect, the agent dynamically oscillates between GUI and code, exploiting the relative strengths of each approach: coding for systematic, routine, or bulk operations; GUI for highly visual, context-dependent, or ambiguous interaction.

3. Performance on OSWorld and Efficiency Gains

Evaluation on OSWorld—a complex, long-horizon computer-use benchmark—demonstrates the substantial real-world impact of coding as an action:

Success Rates: CoAct-1 achieves a new benchmark at 60.76% for tasks with a 100+ step budget and 59.93% under a 100-step budget. Competing systems such as GTA-1 and Agent S2.5 are consistently outperformed.
Step Efficiency: CoAct-1 completes tasks with an average of only 10.15 steps, compared to the 15-step average of leading GUI-based agents—a reduction of approximately 32%.
Task Categories:
- OS-level tasks: 75% success, reflecting the effectiveness of programmatic file and configuration management.
- Multi-application workflows: 47.88% (versus 38.34% in GTA-1).
- Other marked improvements in email (Thunderbird), media control (VLC), and multi-app contexts.
Error Analysis: Reductions in cumulative error rates are correlated with minimized action counts, attributable to script-based atomicity and the avoidance of visually ambiguous GUI steps.

This suggests that hybridizing code execution with GUI control is key for both reliability (lower error rates) and scale (fewer total steps) in generalized automation.

4. Task Delegation Logic and Technical Workflow

CoAct-1’s agent selection and execution process can be formalized through iterative delegation and memory-based reasoning. The paper details the following execution logic in pseudocode:

while not task_complete and steps < MAX_STEPS:
    orchestrator.analyze_state_and_history()
    if orchestrator.deems_coding_suitable():
        programmer.generate_and_execute_code()
        summary, screenshot = programmer.get_feedback()
    else:
        gui_operator.perform_visual_actions()
        summary, screenshot = gui_operator.get_feedback()
    orchestrator.update_memory(summary, screenshot)
    if orchestrator.detects_task_complete():
        break

At each subtask, feedback loops through summarization and screenshots ensure that action outcomes update the Orchestrator’s state. The task continues until the completion signal (which may be detected via environmental cues, successful code output, or GUI feedback), or until a maximum step limit is reached.
The system’s performance can be formally described via an optimization criterion:

$\min_{\text{action sequence}} \{ f(\text{steps}, \text{error}) \}$

subject to task-success constraints, prioritizing minimal steps and errors.

5. Practical Applications and Scenarios

CoAct-1’s hybrid paradigm is suitable for a broad array of automation contexts:

Office Productivity: Automating spreadsheet operations, file compressions, document conversions via code, supplementing with GUI-based drag-and-drop or menu navigation when visual guidance is needed.
Operating System Administration: Managing directories, searching, batch renaming, or configuration changes using Bash or Python scripts.
Multi-application Workflows: Exporting data from one app, processing with a script, and importing or sending it with another app, without context loss or brittle GUI sequences.
Integrated Development Environments: Modifying settings or manipulating internal consoles programmatically for increased robustness.
General Automation: Tasks requiring a mix of systematic backend control and visually-grounded manipulation, e.g., in customer support, content management, or engineering applications.

The system’s generalized delegation policy and multi-modal feedback make it extensible for diverse use cases and responsive to evolving requirements.

6. Design Extensions and Research Directions

Identified avenues for further enhancement include:

Improving the Orchestrator’s reasoning capabilities to better disambiguate high-level or indirect instructions (e.g., mapping “debug console” to internal context switches in IDEs).
Integrating richer visual context via advanced computer vision techniques (e.g., higher-resolution screenshot analysis, real-time OCR).
Incorporating adaptive learning mechanisms whereby agent delegation policies update online, learning from error signals or user feedback.
Broadening the action space to include more scripting languages or extensible APIs, allowing for a wider class of tasks to utilize the efficiency of “coding as an action.”
Developing advanced error recovery and robustness strategies for ambiguous or multi-application workflows where failure recovery and state consistency are major challenges.

These extensions are poised to further increase the reliability, efficiency, and applicability of CoAct-1 in large-scale, real-world computer automation deployments.

7. Summary and Impact

CoAct-1 establishes a new paradigm for computer-using agents by embedding coding directly into the agent action space and orchestrating dynamic task delegation between programmatic and GUI control modes. The approach surpasses previous architectures in both efficiency (marked step reduction) and reliability (significantly higher task success rates) on a challenging, multi-application, long-horizon benchmark. Its versatile architecture, modular agent composition, and robust empirical results outline a scalable blueprint for next-generation autonomous computer automation systems. Future enhancements centering on reasoning, multimodal context, and adaptive learning are plausible drivers for continued advancement in this domain (Song et al., 5 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

CoAct-1: Computer-using Agents with Coding as Actions (2025)

Follow Topic

Get notified by email when new papers are published related to CoAct-1.