Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents (2504.00906v1)

Published 1 Apr 2025 in cs.AI, cs.CL, cs.CV, and cs.LG

Abstract: Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.

Summary

The paper proposes a compositional framework that distributes cognition between generalist and specialist models, using Mixture-of-Grounding for precise GUI element detection.
It employs Proactive Hierarchical Planning to break down long-horizon tasks into manageable sub-goals with dynamic, contextual refinements.
Empirical results demonstrate performance improvements up to 52.8% over state-of-the-art methods, showcasing robust cross-platform GUI automation.

Agent S2 introduces a compositional framework designed to enhance the capabilities of agents interacting with graphical user interfaces (GUIs) for automating digital tasks (2504.00906). The framework addresses identified limitations in existing computer use agents, namely imprecise grounding of GUI elements, challenges in long-horizon task planning, and performance limitations associated with relying on single, monolithic generalist models. Agent S2 proposes a novel architecture that distributes cognitive responsibilities among specialized and generalized models.

Framework Architecture and Rationale

The core concept of Agent S2 is the distribution of cognitive load across a heterogeneous set of models. Instead of a single large model handling perception, planning, and action generation, Agent S2 employs a compositional structure. This involves:

Generalist Models: Likely large multimodal models (LMMs) responsible for high-level understanding, task decomposition, and potentially dialogue or interaction with the user. They provide broad contextual awareness.
Specialist Models: Smaller, fine-tuned models optimized for specific sub-tasks. Examples include models dedicated to precise GUI element detection and localization (grounding), low-level action prediction (e.g., click coordinates, keyboard input), or state assessment.

This decomposition allows each component to be optimized for its specific function, potentially leading to higher accuracy and efficiency. The generalist provides the overarching strategy, while specialists execute specific perceptual or motor functions with greater precision. The interaction between these models is orchestrated within the framework, enabling a cohesive agent behavior.

Mixture-of-Grounding (MoG)

To address the challenge of imprecise GUI element localization, Agent S2 incorporates a technique termed Mixture-of-Grounding (MoG). Precise grounding is critical for accurately identifying and interacting with specific UI elements (buttons, text fields, icons) mentioned in natural language instructions or inferred during planning. MoG likely combines outputs from multiple grounding approaches or models. This could involve:

Visual Grounding Models: Utilizing object detection or visual LLMs specifically trained to map textual descriptions or coordinates to bounding boxes on the screen capture.
Accessibility Tree Information: Leveraging underlying accessibility data (when available) which provides structured information about UI elements, their properties, and relationships.
OCR Integration: Employing Optical Character Recognition to locate text elements directly.

The "mixture" aspect suggests an ensemble or fusion mechanism. This could involve confidence scoring, weighted averaging, or a meta-model deciding which grounding source is most reliable for a given element or context. This approach aims to improve robustness and accuracy compared to relying on a single grounding method, which might fail under specific UI layouts or element types.

Proactive Hierarchical Planning (PHP)

For long-horizon tasks, Agent S2 utilizes Proactive Hierarchical Planning (PHP). This mechanism contrasts with purely reactive or single-level planning methods. PHP operates by:

Hierarchical Decomposition: Breaking down high-level goals (e.g., "book a flight") into sequences of sub-goals or intermediate steps (e.g., navigate to website, enter departure city, select dates, etc.).
Multi-Temporal Scales: Planning occurs at different levels of abstraction and temporal granularity. High-level plans guide the overall strategy, while low-level plans dictate specific sequences of actions (clicks, types).
Proactive Refinement: The "proactive" nature implies that the agent anticipates future states and adjusts plans based on incoming observations (changes in the GUI state). It doesn't just react to the current screen but actively refines upcoming steps in the hierarchy based on the evolving interaction context. This allows the agent to correct course, handle unexpected UI changes, or optimize action sequences dynamically.

PHP likely involves maintaining a plan representation (e.g., a task graph or sequence) that is continuously updated by the generalist model based on feedback from the environment and specialist model outputs (e.g., grounding results).

Implementation and Evaluation

Agent S2's implementation involves integrating multiple distinct AI models. The generalist could be a state-of-the-art LMM (like GPT-4, Claude, or Gemini variants), while specialists might be smaller vision models (e.g., variants of DETR, YOLO, or custom VLM heads) or fine-tuned LLMs. The interaction likely relies on structured input/output formats (e.g., JSON) passed between components.

Evaluation results demonstrate significant performance improvements over existing baselines on multiple benchmarks:

OSWorld: Agent S2 achieved relative improvements of 18.9% on the 15-step evaluation and 32.7% on the 50-step evaluation compared to leading agents like Claude Computer Use and UI-TARS. This highlights its effectiveness in handling longer, more complex tasks.
WindowsAgentArena: A relative improvement of 52.8% over previous SOTA methods was reported, showcasing strong generalization to the Windows operating system environment.
AndroidWorld: Similarly, a 16.52% relative improvement was achieved on this mobile benchmark, indicating cross-platform applicability.

These results support the claim that the compositional generalist-specialist approach, combined with MoG and PHP, yields substantial gains in task success rates and robustness for GUI-based automation across different operating systems and applications. The code release associated with the paper facilitates replication and further development based on this framework.

Conclusion

Agent S2 presents a compositional framework for computer use agents that leverages the strengths of both generalist and specialist models to overcome limitations in grounding, planning, and overall task execution performance. The introduction of Mixture-of-Grounding and Proactive Hierarchical Planning provides specific mechanisms to enhance GUI interaction precision and long-horizon task capabilities. The strong empirical results across diverse benchmarks establish Agent S2 as a state-of-the-art approach in GUI automation, offering a promising direction for developing more capable and reliable digital agents.