CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning (2508.20096v1)

Published 27 Aug 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Autonomous agents for Graphical User Interfaces (GUIs) face significant challenges in specialized domains such as scientific computing, where both long-horizon planning and precise execution are required. Existing approaches suffer from a trade-off: generalist agents excel at planning but perform poorly in execution, while specialized agents demonstrate the opposite weakness. Recent compositional frameworks attempt to bridge this gap by combining a planner and an actor, but they are typically static and non-trainable, which prevents adaptation from experience. This is a critical limitation given the scarcity of high-quality data in scientific domains. To address these limitations, we introduce CODA, a novel and trainable compositional framework that integrates a generalist planner (Cerebrum) with a specialist executor (Cerebellum), trained via a dedicated two-stage pipeline. In the first stage, Specialization, we apply a decoupled GRPO approach to train an expert planner for each scientific application individually, bootstrapping from a small set of task trajectories. In the second stage, Generalization, we aggregate all successful trajectories from the specialized experts to build a consolidated dataset, which is then used for supervised fine-tuning of the final planner. This equips CODA with both robust execution and cross-domain generalization. Evaluated on four challenging applications from the ScienceBoard benchmark, CODA significantly outperforms baselines and establishes a new state of the art among open-source models.

Summary

The paper introduces a dual-brain planner–executor architecture that decouples high-level planning from precise execution through specialized and generalist training stages.
It employs Group Relative Policy Optimization for decoupled RL, using fixed execution and aggregated specialist trajectories to enhance cross-domain generalization.
Empirical results on scientific software benchmarks demonstrate significant improvements in planning accuracy, execution precision, and overall robustness.

CODA: A Dual-Brain Planner–Executor Framework for GUI Agents with Decoupled Reinforcement Learning

Motivation and Problem Setting

The automation of complex GUI-based workflows, particularly in scientific computing and engineering domains, presents a dual challenge: agents must perform long-horizon, high-level planning while also executing precise, low-level actions in visually and functionally intricate environments. Existing monolithic or compositional agent architectures exhibit a trade-off: generalist models (e.g., Qwen2.5-VL) are effective at planning but lack execution precision, while specialist models (e.g., UI-TARS) excel at grounding but are limited in generalization and strategic reasoning. Prior compositional frameworks that decouple planning and execution are typically static and non-trainable, precluding adaptation from experience and limiting their applicability in data-scarce, specialized domains.

Architecture: Planner–Executor Decoupling

CODA introduces a trainable, compositional agent architecture inspired by the functional separation of the human cerebrum (planning) and cerebellum (execution). The framework consists of a high-level Planner (Qwen2.5-VL) and a low-level Executor (UI-TARS-1.5), with a strict decoupling: only the Planner is updated during training, while the Executor remains fixed, providing stable and generalizable action grounding.

Figure 1: The Planner–Executor architecture, with the Planner generating high-level thoughts and the Executor translating them into concrete GUI actions.

The Planner receives the full interaction history and current/previous screenshots, generating explicit, structured plans that specify immediate objectives and target UI elements. The Executor, conditioned on the Planner’s output and the same context, produces parameterized pyautogui commands (e.g., click(x, y)), ensuring precise execution.

Training Pipeline: Specialization and Generalization

CODA employs a two-stage training pipeline:

Specialization (Stage 1): For each target software, the Planner is trained via decoupled reinforcement learning using Group Relative Policy Optimization (GRPO). The Executor is held fixed. The Planner generates multiple candidate plans per step; each is executed by the Executor, and a reward is computed based on action type correctness and parameter precision (e.g., L1 distance for coordinates). The GRPO loss is applied only to the Planner’s reasoning tokens, enabling targeted improvement in strategic planning.
Figure 2: The training process, with the Planner generating plans, the Executor executing actions, and rewards computed for Planner optimization.
Generalization (Stage 2): Following the specialist-to-generalist paradigm, multiple specialist Planners (one per software) are used to generate high-quality trajectories. These are aggregated to fine-tune a generalist Planner via supervised learning, leveraging the diverse, high-quality data distilled from the specialists.

Figure 3: The specialist-to-generalist strategy, aggregating specialist knowledge into a generalist Planner.

This curriculum enables the Planner to acquire both domain-specific expertise and cross-domain generalization, while maintaining execution reliability via the fixed Executor.

Data Generation and Reward Signal

CODA’s training pipeline is supported by an automated data generation and evaluation system:

Task Generation: High-level tasks are generated using Qwen2.5-72B, seeded with a small set of human-instructed examples and detailed prompts.
Judge System: A fine-tuned, ensemble-based judge model (Qwen2.5-VL-72B) provides dense, high-precision reward signals by evaluating agent trajectories for correctness, redundancy, and error localization. Voting and multi-resolution input strategies are used to maximize precision and recall.
Distributed VM System: Task execution is parallelized across a distributed cluster of virtual machines, enabling efficient large-scale trajectory collection.

Empirical Results

CODA is evaluated on four scientific software applications from the ScienceBoard benchmark. The results demonstrate:

Stage 1 (Specialist): Decoupled RL training of the Planner yields substantial improvements over both monolithic and static compositional baselines, with significant gains in success rates across all domains.
Stage 2 (Generalist): The generalist Planner, trained via aggregated SFT from specialist-generated trajectories, outperforms the ensemble of specialists, exhibiting superior reasoning, planning, and cross-domain transfer.
Figure 4: Case studies illustrating the failure modes of Qwen2.5-VL-72B (imprecise grounding) and UI-TARS-1.5 (poor generalization), both addressed by CODA.

The judge system achieves high precision and recall on both in-domain and out-of-domain benchmarks, ensuring reliable reward signals for RL.

Implementation Considerations

Computational Requirements: Training leverages distributed RL and SFT, with the most intensive component being large-scale trajectory collection and judge model inference. The fixed Executor reduces retraining costs and data requirements.
Scalability: The distributed VM system supports hundreds of concurrent environments, enabling efficient scaling to new domains and larger datasets.
Modularity: The decoupled architecture allows for independent upgrades of the Planner and Executor, facilitating continual learning and adaptation.
Limitations: The approach assumes the availability of a strong, generalizable Executor. In domains where grounding is fundamentally different, Executor retraining or adaptation may be required.

Theoretical and Practical Implications

CODA’s results provide strong evidence for the efficacy of decoupled, trainable Planner–Executor architectures in complex, real-world GUI automation. The specialist-to-generalist curriculum demonstrates that domain-specific RL, followed by cross-domain SFT, yields robust generalization without sacrificing execution precision. The reliance on automated, high-precision reward signals and distributed data generation addresses the data scarcity and evaluation bottlenecks prevalent in specialized domains.

Future Directions

Potential extensions include:

Richer Multi-Modal Feedback: Incorporating additional sensory modalities (e.g., audio, haptic) for more comprehensive environment understanding.
Broader Domain Coverage: Scaling to professional and industrial software with highly idiosyncratic interfaces.
Continual and Lifelong Learning: Enabling the Planner to adapt online to novel tasks and environments, leveraging continual RL and meta-learning.
Executor Adaptation: Investigating methods for dynamic Executor adaptation in domains with non-standard action spaces or grounding requirements.

Conclusion

CODA establishes a new state-of-the-art among open-source GUI agents for scientific computing, demonstrating that a trainable, decoupled Planner–Executor framework, supported by robust reward modeling and scalable data generation, can effectively address the dual challenges of long-horizon planning and precise execution. The architecture and training methodology provide a principled foundation for future research in adaptive, generalist computer use agents.