Agentic Crawling Framework: Agent S
- The paper demonstrates that experience-augmented hierarchical planning combined with fused external and narrative memory significantly boosts task success rates.
- Agentic Crawling Framework is defined as autonomous LLM-powered agents that interact with GUIs using dual-modal perception and constrained, language-driven actions.
- Practical insights include rigorous evaluation on OSWorld and WindowsAgentArena benchmarks, underscoring the framework’s adaptability, safety mechanisms, and cross-platform applicability.
An Agentic Crawling Framework is defined as a system in which autonomous agents, powered by LLMs and multimodal capabilities, interact with computers and their graphical user interfaces (GUIs) much as a human would—by observing, planning, acting, learning from both external knowledge and internal experience, and adapting to dynamic, heterogeneous environments. The Agent S system exemplifies this paradigm, enabling automated execution of diverse, multi-step computer tasks through a combination of experience-augmented hierarchical planning, dual-modal perception, constrained language-driven action spaces, and continuously updated episodic and narrative memory modules.
1. System Architecture and Component Design
Agent S consists of three primary architectural components, each addressing major challenges in agentic automation:
- Experience-Augmented Hierarchical Planning: The planning module decomposes complex user tasks () into a sequence of subtasks by leveraging both external web knowledge and accumulated internal experience. Specifically, the system fuses a current observation-aware query (, where includes a screenshot and accessibility tree) with external and internal sources:
Here, queries narrative memory, denotes retrieved web content, and MLLM is a multimodal LLM fusion engine. The resulting grounds the subtask queue , each with contextual data .
- Continuous Memory Update: Agent S maintains two parallel memory systems:
- Narrative Memory () stores summaries of completed tasks as high-level narratives.
- Episodic Memory () records detailed traces of subtask-execution trajectories.
- Both memories are updated through self-supervised exploration and reflective self-evaluation, ensuring that planning and subtask policies improve over time and across tasks.
- Agent–Computer Interface (ACI): The interface provides a dual input consisting of:
- A screenshot for global perception.
- An accessibility tree—augmented with OCR-based text extraction—for precise, language-granular grounding of GUI elements (each with a unique identifier).
- Actions are constrained to a small, discrete set (e.g.,
click(element_id),type(text, element_id),hotkey). This bounded command set allows LLMs to reason about, and execute, semantically meaningful yet verifiable GUI operations, mitigating the risks of ungrounded code synthesis or unsafe API invocation.
2. Planning, Memory, and Reasoning Workflow
The workflow supporting autonomous task execution in Agent S unfolds hierarchically:
- Task Decomposition: Upon receiving a new user task, the Manager invokes an LLM (informed by current GUI state) to generate an explicit query. From this, the system simultaneously retrieves:
- External domain knowledge (via search engines such as Perplexica).
- Internal narrative memory summaries relevant to the task context.
- Knowledge Fusion and Subtask Planning: Extracted knowledge contexts are fused into a topologically sorted subtask queue with contextual augmentations.
- Subtask Execution and Memory Retrieval: For each subtask , the Worker leverages episodic memory via:
This supports in-context retrieval of prior subtask trajectories. A "trajectory reflector" component structures the agent’s reasoning (akin to chain-of-thought), producing grounded, step-wise action decisions.
- Replanning and Error Handling: If execution fails at any step, a
FAILsignal triggers replanning, which incorporates newly encountered experience into the memory update modules.
This hierarchy—continuously refined through real and self-supervised experience—enables the framework to address the long-horizon dependencies, dynamic interface variations, and domain gaps inherent in realistic desktop automation.
3. Perception, Action Representation, and Safety Mechanisms
Effective grounding and control in agentic crawling is realized via the following ACI features:
- Perception: The accessibility tree and screenshot inputs address both visual context and structural GUI detail. Each GUI element is tagged and accessible at high granularity, aided by OCR when needed.
- Action Representation: Actions are issued as constrained, interpretable language primitives. Example primitives include:
click(element_id)type(text, element_id)hotkey(combo)- Each action is grounded and immediately validated within the GUI, supporting incremental feedback for both learning and safety.
- Safety: Restricting the action space prevents code-injection vulnerabilities and limits the effect of any erroneous reasoning from the MLLM. This design, coupled with real-time action validation, supports robust and reliable interference with GUI-based systems.
4. Empirical Evaluation and Performance
Agent S was systematically evaluated on two comprehensive benchmarks:
- OSWorld Benchmark: Contains 369 real computer tasks encompassing operating system manipulation, office work, professional software, and workflow automation. Using GPT-4o, Agent S achieved a 20.58% overall success rate, representing an 83.6% relative improvement over a baseline (success rate 11.21%). Gains were especially evident in knowledge-intensive and multi-step domains (e.g., "Daily" and "Professional" categories).
Ablation studies demonstrated that removing components (web knowledge, narrative memory, or episodic memory) produced significant drops in success rate, confirming the synergy between retrieval and experience augmentation.
- WindowsAgentArena Benchmark: Without any system-specific adaptation, Agent S generalized to a new Windows OS evaluation suite, improving the baseline agent’s performance from 13.3% (NAVI agent) to 18.2% (Agent S with GPT-4o).
These results demonstrate not only the efficacy but also the cross-platform robustness of the experience-augmented, memory-centric approach to agentic crawling.
5. Generalizability, Limitations, and Future Directions
Agent S’s architecture generalizes to diverse operating systems by abstracting away GUI specifics through accessibility-based perception and restricting the action repertoire to universal language primitives. However, several open challenges remain:
- Efficiency: While Agent S achieves strong success rates relative to prior methods, both the number of executed steps and the wall clock time for complex tasks remain high. The authors propose future research into shortest-path and Pareto-optimal navigation strategies, balancing completion time and execution accuracy.
- Model Adaptation: Extending functionality to smaller, open-source LLMs requires task-specific fine-tuning for GUI domains, providing opportunity for broader accessibility and deployment.
- Grounding Errors: Error analysis highlights the need for refinement in the ACI grounding mechanics, particularly in scenarios involving dense or rapidly changing GUI layouts.
Future research is directed toward optimizing plan efficiency, enhancing support for smaller LLMs, and minimizing planning and grounding errors in the perception-action loop.
6. Implementation, Code, and Practical Usage
Agent S is fully open-source and available at [https://github.com/simular-ai/Agent-S]. The reference implementation provides:
- Full experience-augmented hierarchical planning and memory update modules.
- Agent–Computer Interface components supporting screenshot and accessibility tree integration, along with precise, bounded action primitives.
- Task execution logic handling both high-level planning and atomic GUI manipulation.
- Evaluation scripts for the OSWorld and WindowsAgentArena benchmarks.
Researchers and practitioners can directly utilize, extend, or benchmark Agent S, adapting it for further research into automated GUI interaction, desktop control, and related agentic crawling applications.
In sum, the Agentic Crawling Framework demonstrated by Agent S is characterized by hierarchical, experience-driven task decomposition, dual-modal and element-grounded perception, memory-augmented planning, and tightly constrained, interpretable language-based GUI control. This results in state-of-the-art automation capabilities for complex, real-world computer tasks, with empirical validation on diverse and challenging benchmarks and a flexible blueprint for further research in generalized agentic automation.