Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenHands CodeAct + Browsing

Updated 15 June 2026
  • OpenHands CodeAct + Browsing is a paradigm that merges code-level automation and browser interaction to facilitate seamless digital workflows.
  • It employs a multi-agent coordinator with sandboxed runtimes and unified event streams to ensure secure, reliable, and adaptive task execution.
  • Empirical evaluations show that integrating code, browsing, and persistent organizational memory leads to superior performance on complex, multi-step tasks.

OpenHands CodeAct + Browsing is a paradigm for constructing AI agents capable of both code-level automation (code writing, file I/O, command-line execution) and robust web interaction (navigation, extraction, manipulation) within unified, extensible frameworks. Its core technical lineage derives from the OpenHands platform, operationalized through the CodeAct loop and multimodal browsing agents, and is now influenced by architectural advances such as cotomi Act's adaptive scaffolding, Recon-Act’s tool evolution, and the integration protocols defined by the Agent Data Protocol (ADP). These systems form the foundation for current research in generalist digital workers and autonomous agents that natively bridge software development and real-world browser-based workflows.

1. Architectural Components and Agent Interfaces

OpenHands CodeAct + Browsing incorporates a multilayered architecture comprising:

  • CodeAct Engine: An LLM-driven agent that issues code-generating, shell (CmdRunAction), and Python execution (IPythonRunCellAction) commands, maintaining state across event streams. Each action is determined by the LLM completion, parsed, executed in a sandbox (SSH container or Docker), and fed back into the agent context (Wang et al., 2024).
  • Browser Agent: Invoked via BrowserInteractiveAction, this agent executes discrete browser primitives—goto, click, fill, extract, accessibility_tree, screenshot—exposing Playwright/BrowserGym APIs to the LLM (Wang et al., 2024, Soni et al., 3 Jun 2025).
  • Multi-Agent Coordinator: Routes subtasks (e.g., deep navigation) from the main CodeAct agent to specialized BrowsingAgent instances, merging observation results into the main execution history (Wang et al., 2024).
  • Unified Event-Stream State: All observed actions and environment feedback are maintained as a shared history, supporting both code and browser trajectories (Soni et al., 3 Jun 2025).
  • Sandboxed Runtime: Secure containers ensure isolation, resource constraints, and controlled network access for code execution, preventing file system or network escape (Wang et al., 2024).

This design maps naturally onto the data-centric Agent Data Protocol (ADP), which specifies an interleaved sequence of Action and Observation objects, allowing seamless integration of code and browsing modalities for multi-domain fine-tuning (Song et al., 28 Oct 2025).

2. Algorithmic Execution and Interaction Loops

The CodeAct + Browsing loop is characterized by a generate–execute–evaluate pattern:

  • Prompt Construction: The system synthesizes a prompt from prior conversation, the system instruction, and the most recent observations (Wang et al., 2024). For “CodeAct” tasks, this includes available files, test scripts, and prior stdout/stderr; for browsing, the prompt contains the accessibility tree, possible element IDs, and context snippets (Soni et al., 3 Jun 2025).
  • LLM-Driven Action Selection: The backbone LLM emits either natural language (reflective planning) or a JSON-encoded tool action. Browser actions and code execution are handled uniformly as tool calls (Soni et al., 3 Jun 2025).
  • Execution: Each action is executed in the appropriate sandbox—shell, Python REPL, or browser—returning structured observations (text outputs, DOM trees, screenshots) back to the agent (Soni et al., 3 Jun 2025, Wang et al., 2024).
  • Iterative Refinement: On test or observation failure, error messages are used as additional context for subsequent LLM completions, driving test-driven code and robust browser interactions (Soni et al., 3 Jun 2025).
  • Orchestration and Decomposition: Multi-step, high-level goals are naturally decomposed either through chain-of-thought and ReAct prompt engineering, or via explicit best-of-N action selection and majority voting on candidates, leading to ∼4 pp higher task success in browser domains (Oyamada et al., 4 May 2026).

Coarse-grained browser actions, such as scrollInto(element), fill_input, and extract_table, collapse multi-step navigation or data collection into single semantically meaningful steps, leveraging recent advances in abstraction level and context compression (Oyamada et al., 4 May 2026).

3. Persistent Organizational Memory and Behavior-to-Knowledge Pipelines

A distinguishing capability of the current generation of agents is the automatic extraction of persistent organizational knowledge from passive behavior logs (Oyamada et al., 4 May 2026):

  • Passive Data Capture: The system collects click logs, DOM snapshots, tab switches, and screenshots, utilizing diff-compression to minimize storage/recall overhead.
  • Task Segmentation: Activity is segmented into episodes based on inactivity thresholds, which are then summarized into structured records containing title, step sequence, outcome, and category (e.g., “to-do,” “info-lookup”) via LLM-driven ETL.
  • Artifact Aggregation: Aggregated records populate three artifact types within a bidirectionally editable workspace:
    • Activity timeline (calendar, Gantt-style)
    • Task board (kanban view of tasks with id, status, priority, last_updated)
    • Wiki pages (free-form, LLM-distilled, and editable)
  • Approval and Synchronization: Knowledge artifacts feature a lightweight workflow where agent-proposed and user-edited changes appear as “Pending Changes,” requiring explicit acceptance before becoming queryable by the execution scaffold.

This organizational memory demonstrably increases long-horizon agent performance, with up to +10 pp absolute gains in task success on procedural domains as organizational knowledge coverage reaches 100%, with near 99% success when site-specific scripts are available (Oyamada et al., 4 May 2026).

4. Evaluation Benchmarks and Empirical Results

OpenHands CodeAct + Browsing and its variants have participated in major evaluations, demonstrating leading or competitive performance across coding and web domains:

Benchmark Task Type OpenHands/Variant Results Human Best Prior Agent
WebArena@179 (subset) Browsing tasks cotomi Act: 80.4% 78.2% OpAgent: 74.9%
WebArena@812 Browsing tasks cotomi Act: 75.7% 71.6%
SWE-Bench Lite Code tasks OpenHands v1.8 (Claude-3.5): 26.0% Aider: 26.3%
TheAgentCompany Hybrid (code+web) Sonnet-4: 33.14%

Controlled ablations demonstrate that bridging code and browsing tools in a single agent increases coverage of realistic workplace tasks, but sophisticated planning (factoring in organizational artifacts and best-of-N selection) is essential for high performance on long-horizon or multistep workflows. On TheAgentCompany, even state-of-the-art agents reach only 30.3% full success using Gemini-2.5-Pro, with OpenHands’ CodeAct agents (Sonnet-4) at 33.14%, indicating a significant challenge in integrating code-browsing pipelines for deep task automation (Xu et al., 2024, Soni et al., 3 Jun 2025).

5. Advances in Perception, Cognition, and Action: Recent Integrations

Recent systems such as SuperBrowser (Mostafa et al., 8 Jun 2026), Recon-Act (He et al., 25 Sep 2025), and cotomi Act (Oyamada et al., 4 May 2026) have advanced the field along several axes:

  • Vision-First Perception: SuperBrowser introduces an asynchronous vision bounding-box pipeline (object detection for candidate UI elements), DOM enrichment, and prefetching for efficient selection and reduced LLM context size, mirroring human search/scan patterns.
  • Cognitive Modules: The “three-role brain” (Orchestrator, Planner, Worker) separates high-level strategic planning from per-step action selection. The Planner generates nextSteps subgoals every N steps, leveraging a “Ledger” memory abstraction that retains only salient context (goal, subgoal queue, recent actions, key facts), running a six-phase eviction loop for prompt boundedness (Mostafa et al., 8 Jun 2026).
  • Action Execution: A three-tier click cascade (Chrome DevTools Protocol, Puppeteer, DOM.click()), with humanized Bezier mouse motion and chevron-aware snapping, affords robust interaction with complex interfaces and bot-detection avoidance.

These modules can be “dropped into” a CodeAct framework by adding an asynchronous vision API, a memory ledger preprocessor, and a scheduled planner loop tightly coupled to existing LLM-based tool-calling interfaces.

6. Data Integration and Model Training via Agent Data Protocol (ADP)

CodeAct + Browsing agents benefit from unified supervised training using ADP, which abstracts heterogeneous code and browsing interactions into a simple, three-field JSON schema: trajectory ID, alternating content steps (Actions and Observations), and free-form details (Song et al., 28 Oct 2025). This enables:

  • Universal Conversion: Both CodeAct (code editing/execution) and Browsing traces can be converted to ADP via minimal Python scripts, supporting schema validation and balanced sampling strategies.
  • End-to-End Fine-Tuning: Qwen2.5-7B and similar models fine-tuned on ADP-formatted, merged code-and-browsing datasets achieve +15–25% absolute gain on SWE-Bench, WebArena, and general agent benchmarks. For CodeAct, this yields results such as 20.4% vs. 2.8% (base) on OpenHands CodeActAgent and 21.0% vs. 4.5% on WebArena (Song et al., 28 Oct 2025).
  • Best Practices: ADP-driven pipelines facilitate modularity for new tools, QA for action/observation alternation, and scalability for 7–32B model training, with direct support for SFT/evaluation on all major agentic harnesses.

7. Synthesis and Prospects

OpenHands CodeAct + Browsing represents an overview of generalist tool action spaces, dense integration of persistent knowledge, scalable multi-agent orchestration, and data-centric fine-tuning pipelines. Empirical findings confirm that augmenting agents with adaptive observation, best-of-N selection, compressed histories, and persistent organizational knowledge leads to superhuman or near-human performance on web automation tasks and improves transfer to realistic code–browsing hybrid domains. Persistent limitations relate to long-horizon cross-tool dependencies, dynamic or idiosyncratic UIs, and incomplete organizational memory. However, compositional additions such as the SuperBrowser memory ledger or Recon-Act’s closed-loop tool synthesis suggest a pathway to increasingly autonomous, context-persistent digital workers that natively bridge code, browsing, and long-term task management (Oyamada et al., 4 May 2026, Soni et al., 3 Jun 2025, Song et al., 28 Oct 2025, He et al., 25 Sep 2025, Mostafa et al., 8 Jun 2026, Wang et al., 2024, Xu et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenHands CodeAct + Browsing.