AgentNet: Multi-OS Desktop Use Dataset
- AgentNet is a multi-OS desktop use dataset comprising 22,625 trajectories across 100+ applications and 200+ websites, capturing diverse real-world tasks.
- It features a robust annotation infrastructure and a scalable pipeline that transforms raw interactions into concise state-action pairs with multi-level chain-of-thought reasoning.
- The dataset underpins benchmark evaluations, enabling state-of-the-art computer-use agents to generalize effectively across various operating systems and digital environments.
AgentNet is the first large-scale, multi-operating-system desktop computer-use dataset, designed to advance the development and evaluation of general-purpose computer-use agents (CUAs). AgentNet provides high-fidelity, human-annotated demonstrations of diverse real-world computer tasks, along with a robust annotation infrastructure and a scalable transformation pipeline that integrates multi-level Chain-of-Thought reasoning. As an open-source resource, AgentNet underpins the training, benchmarking, and analysis of state-of-the-art vision-language agents across a broad spectrum of digital environments, surpassing all previously open datasets in scale, modality coverage, and operating system diversity (Wang et al., 12 Aug 2025).
1. Composition and Multi-Platform Scope
AgentNet comprises 22,625 trajectories of human demonstrating the execution of computer-use tasks across more than 100 desktop applications and over 200 distinct websites. Crucially, the dataset spans three major operating systems—Windows, macOS, and Ubuntu—capturing a rich variety of application modalities, system UI conventions, and interaction paradigms. This breadth ensures that agents trained on AgentNet are exposed to heterogeneous GUI layouts, input mechanisms, and application-specific workflows, enabling cross-domain generalization and robustness.
OS Platforms | Applications | Web Sites | Task Trajectories |
---|---|---|---|
Windows | 100+ | 200+ | 22,625 |
macOS | |||
Ubuntu |
The data’s coverage includes productivity tools, browsers, file managers, code editors, communication platforms, system utilities, and a broad array of websites—not limited to simple or synthetic environments.
2. Annotation Infrastructure and Data Collection
Data was collected via the AgentNet Tool, a user-facing application operating natively on each of the supported platforms. It records:
- Screen-capture videos for visual context.
- Low-level machine interaction traces such as precise mouse and keyboard events, utilizing frameworks like DuckTrack.
- Accessibility trees (Axtree), capturing structured metadata for on-screen UI elements.
The AgentNet Tool is designed to operate with minimal user disruption, simultaneously enabling real-world task capture and post-hoc annotation. Annotators can review and optionally edit early captures, and the system allows for human or LLM-driven correction of strictness in correctness requirements. The annotation workflow implements multi-level privacy protection, combining anonymization, human oversight, and GPT-based checks to minimize leakage of any personally identifiable information.
3. Data Transformation and Reflective Chain-of-Thought Reasoning
AgentNet transitions raw demonstration streams into a structured task representation, suitable for model training:
- Each trajectory is decomposed into a sequence of compact pairs, where is a keyframe (screenshot before the action), and is a compressed, semantically meaningful abstraction of the human action.
- Low-level event traces are algorithmically compressed: sequences of fine-grained mouse/key events are merged into semantically coherent action primitives (e.g., mouse movements and clicks → "Click Submit"; consecutive keystrokes → "Type 'hello'").
- State-action alignment is precise: for each , is selected by backtracking to a frame that strictly precedes the action, thereby eliminating future information leakage.
A key innovation is the integration of reflective, long Chain-of-Thought (CoT) reasoning in the form of a three-level reasoning trace:
- Level 3 (L3): Contextual observations derived from the screenshot.
- Level 2 (L2): Reflective reasoning over state transitions, preceding actions, and possible errors.
- Level 1 (L1): Succinct final action decision.
The automated CoT synthesis pipeline composes these layers using "generator", "reflector", and "summarizer" modules, demonstrably improving learning as the dataset scales.
4. Benchmarks and Performance Metrics
The OpenCUA-32B agent model, trained on AgentNet, was evaluated on the OSWorld-Verified benchmark (online) and the AgentNetBench (offline proxy):
- Success Rate: 34.8% (100-step budget) on OSWorld-Verified, setting a new SOTA among open-source CUAs and surpassing the proprietary OpenAI CUA (GPT-4o).
- Effect of Test-time Computation: Higher Pass@ budgets (multiple parallel candidate rollouts) further boost success rates, demonstrating effective utilization of data scale and reasoning at inference.
- Generalization: Trained agents generalize across daily use, professional, and system tasks, as well as across operating system boundaries.
Empirical ablation confirms that reflective CoT reasoning and data scale are both critical factors for high performance and robustness to long-horizon, error-prone workflows.
5. Data Accessibility and Open Source Ecosystem
AgentNet, along with the complete OpenCUA infrastructure, is open-sourced. Released components include:
- The full AgentNet dataset.
- The AgentNet Tool for multi-platform scalable data collection.
- The processing pipeline and AgentNetBench evaluation suite.
- Pretrained models and all supporting code.
This facilitates transparent, reproducible research and allows the community to extend, benchmark, and analyze general-purpose CUAs—the first time such a resource has matched commercial counterparts in breadth and depth.
6. Technical Detailing and Notation
The core dataset formation process is notated as:
The pipeline further augments each pair with L3 L2 L1 Chain-of-Thought context.
Element | Format | Purpose |
---|---|---|
High-resolution frame | Pre-action context; input to VLM | |
Reduced action label | Semantically precise intent | |
Reasoning trace | L3, L2, L1 text | Multi-level reflection to support VLM generation |
Collection utilities integrate tools such as DuckTrack for event recording, OBS Studio for frame capture, OpenAdapt for streamlined data handling, and Axtree introspection for UI elements.
7. Significance and Future Directions
AgentNet provides a scalable, high-fidelity foundation for training and evaluation of generalist computer agents, opening the paper of agentic reasoning, robustness, and capability in real-world use cases at a level of detail and coverage that previously only proprietary efforts possessed. The combination of diverse, realistic interactions, multi-modal annotation, and reflective reasoning positions it as an essential resource for benchmarking, safety research, and the extension of agentic models to new domains and applications. As the dataset continues to grow and additional annotations (e.g., error states, user corrections, and complex workflows) are integrated, AgentNet is expected to underpin ongoing advances in open, safe, and general-use agentic intelligence (Wang et al., 12 Aug 2025).