Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 83 tok/s Pro
Kimi K2 175 tok/s Pro
GPT OSS 120B 444 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

CoAct-1: Multi-Agent Automation Framework

Updated 11 August 2025
  • CoAct-1 is a multi-agent framework that integrates GUI actions with code execution to streamline complex computer tasks.
  • It employs an Orchestrator, GUI Operator, and Programmer agent to dynamically decompose and execute subtasks with optimal efficiency.
  • Evaluated on the OSWorld benchmark, it achieves a 60.76% success rate and reduces average task steps, outperforming GUI-only methods.

CoAct-1 is a multi-agent framework for computer automation in which autonomous agents synergistically combine GUI-based actions with direct programmatic execution. The central innovation is to treat coding—specifically, the generation and execution of Python or Bash scripts—as an internalized, native agent action, thereby bypassing many of the inefficiencies and brittleness characteristic of GUI-only approaches. The CoAct-1 system incorporates an Orchestrator for high-level planning, a GUI Operator for conventional visual interaction, and a Programmer agent for code-based task completion. Evaluated on the OSWorld benchmark, CoAct-1 achieves state-of-the-art results, indicating substantial empirical gains in both task success and efficiency relative to prior GUI-centric baselines (Song et al., 5 Aug 2025).

1. System Architecture and Core Components

CoAct-1 is structured as a modular multi-agent system, each component specializing in a distinct interaction modality. The three primary components are:

  • Orchestrator: The planning component. Receives the user's natural-language goal and state observations (such as screenshots and logs), decomposes the main task into a sequence of subtasks, and delegates each subtask to either the GUI Operator or the Programmer agent based on which route is most suitable for solving it efficiently.
  • GUI Operator: A vision-LLM responsible for carrying out actions via the GUI (mouse clicks, keystrokes, drag-and-drop). This agent operates by interpreting images and contextual text before generating natural language instructions that are then translated into actionable system operations by a GUI interpreter.
  • Programmer Agent: Specialized in generating and executing scripts (Python or Bash) directly on the system. This bypasses lengthy or error-prone GUI manipulations—especially for file management, data processing, and multi-app workflows. The Programmer interacts in a multi-turn conversation loop with a code interpreter, ensuring correct translation of system-level goals into a sequence of code executions.

Each agent maintains a local conversation memory during its subtask allocation. All interactions are centrally coordinated by the Orchestrator, which also manages step budgets and system context across subtasks.

2. Technical Workflow and Operational Boundaries

At each decision point, the Orchestrator evaluates the user input and current system state to select the optimal action channel:

  • When a subtask is more efficiently solved by coding (e.g., batch file copying, text parsing), it delegates to the Programmer agent, which iteratively proposes code, receives execution feedback, and revises as needed until the subtask completes or a preset round/step limit is reached.
  • When GUI interaction is requisite (e.g., navigating application menus), the Orchestrator routes the subtask to the GUI Operator, which then generates and issues the required GUI actions.

To ensure focus and efficiency, CoAct-1 introduces upper bounds on task execution:

  • Programmer agent: Maximum rounds I=20I = 20
  • GUI Operator: Maximum steps K=25K = 25

Total system steps per episode are thus bounded by I×KI \times K, and both agents can request additional context from the Orchestrator within these limits.

This configuration supports robust planning: the Orchestrator leverages feedback from both agents (execution results, error messages, system screenshots), updates its internal state representation, and iteratively refines the task decomposition and agent assignments.

3. Empirical Performance and Benchmarking

CoAct-1 is evaluated on the OSWorld benchmark, a rigorous test suite for computer-based agents involving multi-application workflows, OS-level tasks, and real-world productivity scenarios. Key empirical results are:

Metric CoAct-1 GUI-only SOTA
Success Rate @ 100+ steps budget 60.76% < 59.93%
Average steps per successful task 10.15 15
  • State-of-the-art robustness: At a 100+ step interaction budget, CoAct-1 outperforms specialized GUI-only methods, establishing a new SOTA success rate of 60.76%.
  • Efficiency: The mean number of steps per completed task is reduced from 15 (GUI-only) to 10.15, reflecting the impact of code-based "macro" actions in streamlining execution.
  • Subdomain performance: In scenarios such as "LibreOffice Calc" operations and multi-application pipelines, the Programmer agent demonstrates marked superiority over GUI-based alternatives due to its ability to encode complex actions succinctly in code.

4. Applications and System Implications

By natively incorporating code execution as an agent action, CoAct-1 generalizes across a broad set of real-world computer automation domains:

  • End-user automation: Accelerates and stabilizes desktop operations that are tedious or error-prone with pure GUI automation—such as bulk data processing, CSV file manipulation, and batch system administration.
  • Multi-application orchestration: Facilitates transitions and data flow between software (e.g., from spreadsheets to browsers), enabling coherent workflows that would otherwise require unreliable GUI scripting.
  • Scalability and System Integration: The multi-agent modularity and decision-theoretic delegation protocol allow for scaling to diverse operating environments, making the framework viable in both home-user and enterprise/IT contexts.

A plausible implication is that this approach, by leveraging both the universality of GUIs and the expressivity of scripting, is well suited for future human-in-the-loop or hybrid-automation scenarios, including assistive agents for system administrators, power users, and technical end-users.

5. Design Rationale and Methodological Significance

CoAct-1's primary departure from previous frameworks is the selection of coding as a first-class agent action, rather than viewing code generation as a separate pipeline. Its agentic structure includes:

  • Dynamic multi-step orchestration: The Orchestrator's planning loop is stateful, memory-augmented, and able to re-evaluate plans mid-execution based on agent feedback, mitigating error propagation.
  • Conversational inter-agent protocol: Each agent (Operator or Programmer) holds a dialogue state and can request clarification or signal impasses, supporting recovery from partial failures and enabling fine-grained error handling.
  • Operational efficiency: Through bounded agent interactions, the system maintains tractability even in complex, long-horizon tasks.

This suggests a shift in the design paradigm for agentic automation: integrating code generation into the agent's action repertoire can transform the class of solvable tasks, reduce interaction cost, and improve system robustness against surface-level GUI changes.

6. Limitations and Future Research Directions

The framework as described identifies avenues for further development:

  • Orchestrator reasoning: Improvement is needed in high-level intent inference, particularly for ambiguous queries or goals requiring layered context.
  • Programming augmentation: Extension to additional scripting languages and domain-specific program synthesis may increase coverage for professional workflows.
  • Dialogue and memory enhancements: Continued optimization of agent interaction history and context sharing could enable more effective handling of tasks with high cross-subtask dependency.
  • Advanced perception: Integration with stronger vision-LLMs could reduce GUI misinterpretation errors, particularly in visually dynamic or unstructured interfaces.
  • Dynamic modality balancing: Further research could adaptively adjust the use of coding versus GUI actions, leveraging user feedback or domain characteristics to optimally allocate computational effort.

The authors propose that addressing these limitations could lead to higher overall system success rates and broader applicability across additional automation domains.

7. Relationship to Broader Research Context

CoAct-1 represents the convergence of several trends in autonomous agent design:

  • Internalizing programmatic capabilities: Elevates code as a native action space (cf. recent work on agentic LLMs with integrated tool use).
  • Conversation-centered modularity: Echoes multi-agent, planner-executor models for long-horizon task solving, but with explicit code-generation alongside visual interaction.
  • Benchmark-driven evaluation: Establishes clear quantifiable metrics for agent efficiency and reliability on a realistic, multi-app testbed (OSWorld).

In summary, CoAct-1 systematically advances computer automation by integrating code execution alongside GUI actions within a multi-agent orchestrated workflow, yielding substantial gains in both efficiency and success rates on complex real-world automation tasks (Song et al., 5 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CoAct-1 Framework.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube