Papers
Topics
Authors
Recent
2000 character limit reached

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents (2511.03690v1)

Published 5 Nov 2025 in cs.SE and cs.AI

Abstract: Agents are now used widely in the process of software development, but building production-ready software engineering agents is a complex task. Deploying software agents effectively requires flexibility in implementation and experimentation, reliable and secure execution, and interfaces for users to interact with agents. In this paper, we present the OpenHands Software Agent SDK, a toolkit for implementing software development agents that satisfy these desiderata. This toolkit is a complete architectural redesign of the agent components of the popular OpenHands framework for software development agents, which has 64k+ GitHub stars. To achieve flexibility, we design a simple interface for implementing agents that requires only a few lines of code in the default case, but is easily extensible to more complex, full-featured agents with features such as custom tools, memory management, and more. For security and reliability, it delivers seamless local-to-remote execution portability, integrated REST/WebSocket services. For interaction with human users, it can connect directly to a variety of interfaces, such as visual workspaces (VS Code, VNC, browser), command-line interfaces, and APIs. Compared with existing SDKs from OpenAI, Claude, and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis. Empirical results on SWE-Bench Verified and GAIA benchmarks demonstrate strong performance. Put together, these elements allow the OpenHands Software Agent SDK to provide a practical foundation for prototyping, unlocking new classes of custom applications, and reliably deploying agents at scale.

Summary

  • The paper introduces a modular, composable SDK that redefines agent architecture with optional isolation, event-sourced state management, and two-layer composability.
  • It details robust methodologies including multi-LLM routing, integrated REST/WebSocket services, and secure secret management for production deployment.
  • Benchmark results validate the SDK’s competitive performance and scalability, establishing it as a viable framework for both research and industrial applications.

The OpenHands Software Agent SDK: Architecture, Implementation, and Evaluation

Introduction

The OpenHands Software Agent SDK presents a composable, extensible, and production-oriented foundation for building software engineering agents. The SDK is a complete architectural redesign of the original OpenHands framework, motivated by the need for flexibility, reliability, and scalable deployment in real-world software development environments. The design addresses critical limitations of monolithic agent architectures, introducing modularity, statelessness, strict separation of concerns, and two-layer composability. The SDK supports seamless local-to-remote execution, integrated REST/WebSocket services, and direct connectivity to diverse user interfaces, including VS Code, VNC, and browser-based workspaces.

Architectural Evolution and Design Principles

The transition from OpenHands V0 to V1 is characterized by a shift from a tightly coupled, sandbox-centric monolith to a modular SDK with clear boundaries and opt-in sandboxing. Figure 1

Figure 1

Figure 1: OpenHands V0 architecture: monolithic, tightly coupled components with mandatory sandboxing, leading to duplicated implementations and brittle local execution workflows.

V0’s universal sandboxing model, while ensuring safety, introduced significant friction for local workflows and required duplicated logic for CLI runtimes. V1 refactors this into four decoupled packages: SDK, Tools, Workspace, and Agent Server. This modularity enables independent development, testing, and deployment, supporting both rapid prototyping and robust production scenarios.

The four guiding principles of V1 are:

  • Optional isolation: Agents run locally by default, with transparent opt-in sandboxing for safety.
  • Stateless by default: All components are immutable and validated at construction; mutable context is isolated in a single conversation state object.
  • Strict separation of concerns: The agent core is decoupled from applications, enabling shared library usage and preventing logic duplication.
  • Two-layer composability: Developers can compose deployment packages and extend the SDK by adding or replacing typed components.

Core Components and Implementation

Modular Four-Package Design

The SDK is organized into four Python packages:

  • openhands.sdk: Core abstractions (Agent, Conversation, LLM, Tool, MCP, event system).
  • openhands.tools: Concrete tool implementations.
  • openhands.workspace: Execution environments (local, Docker, hosted API).
  • openhands.agent_server: Web server exposing REST/WebSocket APIs.

This separation enables lightweight integration, isolated testing, and incremental release cycles, critical for production deployments.

Event-Sourced State Management

The SDK employs an event-sourcing pattern, treating all interactions as immutable events appended to a log. The ConversationState class is the sole mutable component, maintaining metadata and an append-only event log. This design enables deterministic replay, strong consistency, and efficient incremental persistence. Conversations can be resumed by loading the base state and replaying events, supporting robust fault recovery.

LLM Abstraction and Multi-LLM Routing

The LLM class provides a unified interface to 100+ LLMs via LiteLLM, supporting both standard and advanced reasoning APIs. The SDK captures native reasoning fields (e.g., Anthropic’s ThinkingBlock, OpenAI’s ReasoningItemModel) and implements fallback mechanisms for non-function-calling models using prompt-based tool invocation and regex extraction.

Multi-LLM routing is supported via the RouterLLM class, enabling dynamic model selection based on input content (e.g., routing text to a cheaper model, images to a multimodal model).

(Figure 2)

Figure 2: Multi-LLM routing example: RouterLLM delegates requests to selected models based on message content.

Tool System and MCP Integration

The tool system is grounded in an Action–Execution–Observation pattern, with type-safe input validation, structured execution, and LLM-compatible output formatting. MCP tools are treated as first-class SDK tools, with automatic schema translation and structured observation. The registry-based resolution mechanism supports distributed execution and lazy instantiation, enabling tool specs to cross process or network boundaries as pure JSON.

(Figure 3)

Figure 3: Tool system structure: Actions validate inputs, executors run logic, and observations format outputs for LLMs.

Agent Abstraction and Event-Driven Execution

Agents are stateless, immutable specifications, with event-driven execution loops that emit structured events via callbacks. This enables security interleaving, incremental execution, and event streaming for real-time UI updates. Agent context is customizable via skills and prompts, supporting rich behavioral augmentation. Sub-agent delegation is implemented as a standard tool, enabling hierarchical coordination without modifying the core SDK.

Context Window Management

The Condenser system manages context window limits by summarizing and replacing events when history grows too large. The default LLMSummarizingCondenser reduces API costs by up to 2×2\times with no degradation in agent performance.

Local and Remote Execution

The Conversation class abstracts over local and remote execution, enabling seamless migration from prototyping to production. Local conversations run in-process; remote conversations delegate execution to an agent server via HTTP/WebSocket, supporting containerized multi-user deployments.

(Figure 4)

Figure 4: Local-to-remote transition: swapping workspace type enables seamless migration without code changes.

The agent server implements REST endpoints and WebSocket streaming, supporting scalable, isolated execution via official Docker images with dedicated file systems and resources.

(Figure 5)

Figure 5: Agent server architecture: client serializes agent configuration via HTTP; server executes using SDK components inside the container and streams events via WebSocket.

Workspace Abstraction

The BaseWorkspace class enables sandboxes for agents, with local and remote implementations forwarding operations to the host or delegating over HTTP, respectively. The factory pattern ensures agent code remains unchanged across environments.

(Figure 6)

Figure 6: Workspace interface: implementations handle environment details for local and remote execution.

Security and Secrets Management

Security is a first-class concern, with the SecurityAnalyzer rating tool calls and the ConfirmationPolicy determining user approval requirements. The SDK includes built-in risk assessment and confirmation workflows, supporting adaptive trust and custom policies. The SecretRegistry provides secure, session-isolated credential management, with automatic masking and support for live rotation.

Reliability, Evaluation, and Benchmarking

Continuous Quality Assurance

The SDK employs a three-tier testing strategy:

  • Programmatic tests: Mock LLM calls for fast feedback.
  • LLM-based tests: Integration and example tests with real models, costing \$0.5–\$3 per run.
  • Benchmark evaluation: High-cost, comprehensive evaluations on academic datasets.

(Figure 7)

Figure 7: Integration test framework: scenario-based workflows for agent reliability.

Benchmark Results

The SDK demonstrates competitive performance on SWE-Bench Verified and GAIA benchmarks:

Benchmark Model Performance
SWE-Bench Claude Sonnet 4.5 72.8%
SWE-Bench GPT-5 (reasoning=high) 68.8%
GAIA Claude Sonnet 4.5 67.9%
GAIA GPT-5 (reasoning=high) 62.4%
GAIA Qwen3 Coder 480B A35B 41.2%

These results validate the SDK’s architecture as competitive with research-focused systems, with no loss in agentic capability.

A systematic comparison with OpenAI Agents SDK, Claude Agent SDK, and Google ADK reveals that OpenHands uniquely combines native remote execution, production server with sandboxing, model-agnostic multi-LLM routing, and built-in security analysis. The SDK also provides features absent in other frameworks, such as context condensation, secrets management, stuck detection, and interactive workspace access.

Practical and Theoretical Implications

The OpenHands Software Agent SDK establishes a robust foundation for both research and industrial-scale deployment of software engineering agents. Its composable architecture, stateless event sourcing, and modular design enable reproducible, deterministic execution and seamless transition from local prototyping to production. The integration of security, secrets management, and interactive workspace access addresses critical concerns for real-world deployment. The SDK’s empirical performance on standardized benchmarks demonstrates its generality and reliability across diverse model backends.

Theoretically, the event-sourced, stateless design provides a blueprint for scalable agent architectures, supporting fault recovery, reproducibility, and extensibility. The separation of concerns and composability principles facilitate rapid experimentation and safe extension, enabling the development of new classes of custom applications.

Future Directions

Potential future developments include:

  • Enhanced support for asynchronous and distributed agent coordination.
  • Integration with advanced memory systems for long-term context retention.
  • Expansion of security analysis to cover adversarial scenarios and prompt injection.
  • Automated benchmarking and continuous evaluation pipelines for large-scale deployments.
  • Further abstraction of workspace environments to support hybrid cloud/local execution.

Conclusion

The OpenHands Software Agent SDK delivers a composable, stateless, and production-ready foundation for software engineering agents. Its modular architecture, event-sourced state management, and robust security features enable reliable operation across heterogeneous environments. Empirical evaluation confirms strong performance and consistency, validating the SDK as a practical and extensible platform for both research and industrial applications. The design principles and implementation strategies outlined in this work provide a template for future agent frameworks seeking to balance flexibility, reliability, and scalability.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Explaining “The OpenHands Software Agent SDK” in Simple Terms

What is this paper about?

This paper introduces a new toolkit (called an SDK) that helps people build smart software helpers—“AI agents”—that can write code, run programs, and fix bugs. The goal is to make these agents easy to build, safe to run, and reliable both on your laptop and on big servers.

What questions is the paper trying to answer?

Here are the main things the authors wanted to improve:

  • How can we make AI coding agents flexible so they work both locally (on your computer) and remotely (in a secure server) without changing lots of code?
  • How can we keep these agents safe and reliable, especially when they run commands or access files?
  • How can we give people good ways to interact with the agents (like in VS Code, a browser, or the command line)?
  • How can we design the system so it’s modular (built from parts) and easy to extend and test?
  • Do these design choices still perform well on real tasks and benchmarks?

How does the system work? (Methods and approach, with simple explanations)

Think of the SDK as a well-organized workshop with clear stations and safety rules, designed to help a smart assistant get work done.

  • What’s an SDK? An SDK (Software Development Kit) is a set of tools, code, and guidelines you use to build software. Here, it helps you build AI agents for software development.
  • What’s an AI agent? An agent is like a smart assistant that can read instructions, think, use tools, and act in a computer environment (like editing files or running tests).
  • What’s a sandbox? A sandbox is a safe, controlled room where the agent can run commands without risking the rest of your computer—like giving it a safe play area.
  • What’s event-sourcing? Imagine the agent keeps a diary. Every action it takes (a message, a tool it used, a result it saw) gets written as a new diary entry (an “event”). This makes it easy to replay what happened, recover after crashes, and understand the agent’s history.
  • What’s an LLM? LLM stands for LLM, like Claude or GPT. It’s the brain that reads and writes text, decides which tools to use, and plans steps.
  • What’s MCP? MCP (Model Context Protocol) is a way to define and use external tools consistently across different models, so the agent can plug into tools like a web browser or file manager easily.
  • What’s REST/WebSocket? These are standard ways computers talk over the internet. REST is like sending letters; WebSocket is like a live phone call for streaming updates.

The SDK is built around four big design principles (think of these as the workshop’s rules):

  1. Optional isolation (sandboxing): Run locally by default for fast testing, and switch to a sandbox only when needed for safety.
  2. One source of truth for state: The agent’s configuration doesn’t change as it runs; only the conversation diary (event log) changes. That makes replay and recovery predictable.
  3. Strict separation of concerns: The core agent is separate from apps (like CLI, web UI, or GitHub apps), so you don’t mix the agent’s brain with the user interface.
  4. Two-layer composability: You can mix and match packages (SDK, Tools, Workspace, Server), and safely add or replace typed parts like tools and agents.

To make this work in real life, the SDK is split into four packages (like different workstations in the workshop):

  • sdk: The core brains and rules (Agent, Conversation, LLM, Tool, events).
  • tools: Actual tools the agent can use (e.g., run commands, edit files, browse).
  • workspace: Where the agent runs (local machine vs. remote container).
  • agent_server: A web server so agents can run remotely and stream events live.

Two particularly important ideas:

  • Tools follow an “Action → Execution → Observation” pattern. The LLM asks to use a tool (Action), the tool runs (Execution), and the results are captured (Observation). This makes tools safe and predictable.
  • Local to remote with minimal changes. You can start with a local workspace while testing. When you’re ready to deploy securely, you swap in a remote workspace (like Docker) with almost the same code.

The SDK also includes:

  • Context window management (condensing long histories into summaries to keep model costs low).
  • A Secret Registry (for safely handling API keys—masking them so they don’t leak).
  • A Security Analyzer and Confirmation Policy (the agent can pause and ask for approval if something looks risky—like deleting files).

What did they find? (Main results and why they matter)

The authors compared their SDK to other major ones (from OpenAI, Claude, and Google) and highlighted features that make OpenHands stand out, including:

  • Built-in secure remote execution with a production server.
  • Model-agnostic support for 100+ LLM providers, plus smart routing across models (choose cheaper or more capable models depending on the task).
  • Native sandboxing and lifecycle controls (pause/resume, restore history, delegate tasks to sub-agents).
  • Integrated security checks and secret masking.
  • Support for models that don’t have “function calling” by teaching them how to use tools through prompts.

Performance-wise, they tested on two recognized benchmarks:

  • SWE-Bench Verified (coding and bug-fixing tasks): Strong resolution rates (up to about 72% with Claude Sonnet 4.5).
  • GAIA (general computer tasks): Competitive scores (up to about 68% with Claude Sonnet 4.5).

These results show that the SDK is not just flexible—it also works well on real tasks.

Why is this important? (Implications and impact)

  • Faster prototyping and safer deployment: Developers can try ideas locally and then deploy to secure servers without rewriting their code.
  • Greater reliability: Event logs and deterministic state make it easier to debug, replay, and recover long-running agent sessions.
  • Better security: Built-in risk checks and secret handling help prevent unsafe actions and leaks.
  • More flexibility: You can use many different LLMs, add custom tools, and build complex behaviors without touching the core system.
  • Community-friendly: It’s open-source (MIT License), so teams, researchers, and companies can adopt, extend, and collaborate.

In short, this SDK gives people a solid foundation to build powerful, safe, and scalable AI software agents—making it easier to move from cool demos to real, production-ready systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of unresolved issues and missing evidence that future work could address.

  • Sandbox vs. local execution trade-offs are not quantified: no measurements of latency, throughput, crash isolation, resource contention, or failure rate differences between LocalWorkspace and containerized RemoteWorkspace across representative workloads.
  • Absence of a formal security threat model and end-to-end security evaluation: the paper does not detail attack surfaces (privilege escalation, filesystem exfiltration, tool supply-chain compromise, MCP server trust boundaries, API server attacks) or demonstrate effective mitigations via penetration testing or red-team exercises.
  • Default “local-first” execution implies elevated risk on user machines: the safety implications of opting out of sandboxing by default are not analyzed, and policy guidance for when to enforce isolation is missing.
  • LLMSecurityAnalyzer reliability and calibration are unproven: there is no empirical assessment of false negative/positive rates in risk ratings, susceptibility to prompt injection/adversarial examples, or alignment with a standardized risk taxonomy.
  • ConfirmationPolicy UX and efficacy are not evaluated: no data on user burden, time-to-approval, rates of unnecessary blocks, or overall impact on task success; UI/UX design for confirmations is not studied.
  • SecretRegistry lacks concrete operational guarantees: the paper omits cipher choices, key storage/rotation strategy, HSM/KMS integration, secure lifecycle of secrets in memory, and evidence that masking prevents leakage via side channels (logs, file writes, screenshots).
  • Event-sourcing semantics need rigor: exactly-once processing, idempotency of tool actions, event ordering under concurrency, schema versioning/migration strategies, and cross-version replay guarantees are not specified or validated.
  • “Deterministic replay” is underspecified given LLM nondeterminism: how replay interacts with variable LLM outputs (temperature, provider changes), and whether seeds or cached responses are used to ensure reproducibility, is unclear.
  • RemoteConversation reliability under network faults is not discussed: retry policies, backoff strategies, idempotent REST endpoints, WebSocket reconnection behavior, and consistency after partial failures are not described or measured.
  • MCP integration robustness is untested: handling of MCP server version drift, schema evolution, timeouts, authentication/authorization, and the security model for executing untrusted MCP tools are not evaluated.
  • Non-native function calling via prompt+regex lacks reliability analysis: no error-rate quantification (mis-parsing, tool-call hallucinations), injection-safety assessment, or fallback mechanisms are provided.
  • Multi-LLM routing strategies are minimal: no learning-based or feedback-driven routing, cost–quality optimization, or empirical evidence that routing improves performance/cost; selection criteria beyond a toy multimodal example are unspecified.
  • Context condensation risks and tuning are underexplored: the impact of summarization errors on task outcomes, parameter sensitivity (condense thresholds, summary granularity), and comparisons of condenser algorithms are not empirically characterized.
  • Sub-agent delegation is limited to blocking execution: no support for asynchronous coordination, dynamic scheduling, inter-agent communication, failure recovery of sub-agents, or resource management across delegated tasks.
  • Long-term memory across sessions is not supported: the table indicates this feature is missing; design, safety implications (privacy), and empirical benefits or risks are unresolved.
  • Benchmark scope is narrow and lacks ablations: results are limited to SWE-Bench Verified and GAIA with no ablation isolating architectural contributions (e.g., event-sourcing, condensation, routing), no sensitivity to workspace types, and no cost/latency benchmarks.
  • Reproducibility is uncertain: reliance on frontier proprietary models (e.g., Claude Sonnet, GPT-5) lacks details on exact versions, temperatures, prompts, provider configurations, and mitigations for provider-side nondeterminism.
  • Production scaling characteristics are not reported: throughput per agent/server, resource footprints, multi-tenant isolation and fairness, autoscaling policies, and queueing under load are not measured.
  • Observability and governance are incomplete: logging/tracing/metrics design, PII handling in event logs, retention policies, auditability, and compliance (e.g., GDPR/CCPA) are not specified.
  • API server security controls are not detailed: authentication/authorization (RBAC), per-tenant isolation boundaries, secure transport, rate limiting/DoS protections, and secret-handling over the wire are not described.
  • Tool registry and dynamic executor resolution pose trust risks: the security model for binding tool definitions at runtime, code injection mitigation, and provenance validation of tool executors are not addressed.
  • Human-in-the-loop interaction is underexamined: conflict resolution when users edit files concurrently with the agent, UX for interactive terminals/IDEs, and the cognitive load and acceptance of intervention workflows are not studied.
  • Failure handling and compensating actions are unspecified: policies for partial tool effects (e.g., partially applied edits), transactional semantics across tools, and recovery strategies after mid-step failures are not described.
  • Testing strategy may be brittle to model drift: LLM-based CI lacks coverage metrics, flakiness analysis, and controls for provider changes; how tests gate releases or detect subtle regressions is unclear.
  • Compatibility and migration guidance is missing: how third-party tools/MCP integrations upgrade across SDK versions, event schema changes, and backward compatibility guarantees are not documented.
  • Ethical and compliance considerations are unexplored: safeguards for agents acting on external systems (e.g., web, code repos), guardrails for destructive actions, and organizational policy integration are not discussed.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

Below are actionable use cases that can be deployed now, supported by the SDK’s modular architecture, event-sourced reliability, opt-in sandboxing, REST/WebSocket server, MCP integration, multi-LLM routing, security/confirmation controls, and strong benchmark performance.

  • Automated bug triage and patching in CI/CD
    • Sectors: software, DevOps
    • Tools/workflows: Conversation with local repo, DockerWorkspace for isolated runs, get_default_agent, built-in test instrumentation; SWE-Bench-like workflows to reproduce, edit, run tests, and open PRs
    • Assumptions/dependencies: Accessible test suites and CI runners; LLM coding proficiency and API keys; repo permissions; guardrails via SecurityAnalyzer/ConfirmationPolicy
  • Agent-powered remote sandboxes for developers and support teams
    • Sectors: software, customer support, IT operations
    • Tools/workflows: agent_server + DockerWorkspace with VS Code Web/VNC/Chromium for reproducing issues, inspecting environments, and guided fixes
    • Assumptions/dependencies: Container orchestration (Docker/Kubernetes), network access to agent server, role-based access control (RBAC)
  • Secure, policy-gated local automation for enterprise desktops
    • Sectors: finance, healthcare, gov/public sector
    • Tools/workflows: SecretRegistry with auto-masking, SecurityAnalyzer + ConfirmRisky policies, opt-in sandboxing; controlled file edits, git operations, browsing
    • Assumptions/dependencies: Enterprise secrets store integration; policy configuration; audit needs met via event logs and deterministic replay
  • Cost-optimized multi-LLM routing for agent tasks
    • Sectors: software, cloud ops, platform engineering
    • Tools/workflows: RouterLLM routes text-only steps to cheaper models and multimodal steps to stronger models; LiteLLM across 100+ providers
    • Assumptions/dependencies: Model availability via LiteLLM; routing policy tuned to task complexity; cost/latency monitoring
  • Unifying internal tools via MCP for a single agent interface
    • Sectors: enterprise software, data platforms
    • Tools/workflows: MCP tools treated as first-class via MCPToolDefinition and MCPToolExecutor; standardize tool schemas and results
    • Assumptions/dependencies: MCP servers for internal tools; tool schema hygiene; secure transport and credentials
  • Reproducible, auditable agent sessions for compliance and debugging
    • Sectors: compliance, risk, SRE/ops
    • Tools/workflows: Event-sourced ConversationState, deterministic replay, pause/resume; build “agent observability” dashboards over REST/WS streams
    • Assumptions/dependencies: Persistent storage for base_state.json and event JSONs; log retention policies; PII/secret redaction configured
  • Automated QA and test generation in development workflows
    • Sectors: software, QA, platform engineering
    • Tools/workflows: Built-in programmatic and LLM-based tests; example/integration test harness (BaseIntegrationTest); generate unit/integration tests and run them
    • Assumptions/dependencies: Access to test runners and environments; cost budgets for LLM test passes; CI integration
  • Education: agent-assisted programming labs and remote IDEs
    • Sectors: education, bootcamps
    • Tools/workflows: LocalConversation for quick iteration in notebooks; DockerWorkspace for class assignments in isolated containers; skills loaded from .openhands/skills/
    • Assumptions/dependencies: Course infrastructure providing per-student containers; API keys; instructor policies for confirmation/risk
  • Research: reproducible agent experiments and benchmarking
    • Sectors: academia, industrial research
    • Tools/workflows: Same agent spec across local/remote; built-in support for SWE-Bench Verified and GAIA; event logs for rigorous ablation/replay
    • Assumptions/dependencies: Benchmark datasets and harnesses; standardized tool suites; model access (e.g., Claude Sonnet 4.5, GPT-5 variants)
  • Browser automation for ops tasks and knowledge work
    • Sectors: operations, customer support, marketing
    • Tools/workflows: Persistent Chromium in remote workspace; structured tools for browsing, scraping, and screenshot capture with secret redaction
    • Assumptions/dependencies: Site policies/legal compliance for scraping; resource quotas to prevent container crashes; human-in-the-loop confirmation for risky actions
  • Personal project maintenance and local coding assistant
    • Sectors: daily life
    • Tools/workflows: LocalConversation to create/edit files, run commands, manage small repos; conditional skills and tool prompts; user confirmation on sensitive actions
    • Assumptions/dependencies: Local dev environment; API keys; basic policy settings to prevent destructive commands

Long-Term Applications

The following applications require further research, scaling, or development—especially around orchestration, safety, compliance, and organizational adoption.

  • Autonomous software maintenance at org scale
    • Sectors: software, platform engineering
    • Tools/workflows: Hierarchical agents with advanced (async) delegation, dynamic scheduling, fault-tolerant recovery; standardized change management and PR pipelines
    • Assumptions/dependencies: Robust sub-agent orchestration beyond current blocking tools; enterprise-grade guardrails; team trust and governance
  • Agent-as-a-Service platforms and marketplaces
    • Sectors: cloud platforms, developer tools
    • Tools/workflows: Multi-tenant agent_server with tool registry; per-tenant isolation; usage metering and billing; curated MCP tool catalogs
    • Assumptions/dependencies: SaaS ops maturity (quotas, isolation, billing), curated tool quality and security vetting, SLAs
  • Enterprise compliance and audit suites for AI agents
    • Sectors: finance, healthcare, gov/public sector
    • Tools/workflows: Policy engines over event logs, standardized risk taxonomies; audit/forensics dashboards; continuous DLP monitoring
    • Assumptions/dependencies: Regulatory acceptance of agent logs as audit artifacts; integration with GRC systems; formalized policies and attestations
  • Formal methods and static analysis integrated with the security analyzer
    • Sectors: software, safety-critical systems
    • Tools/workflows: Combine SecurityAnalyzer with program analysis, sandbox escape detection, and formal verification of changes/tests
    • Assumptions/dependencies: Mature static/dynamic analyzers; mappings from tool outputs to formal guarantees; model + tool reliability
  • Cross-application desktop automation with robust UI agents
    • Sectors: productivity software, RPA/robotics
    • Tools/workflows: General desktop agents coordinating terminal, browser, editors, and enterprise apps with MCP tools; consistent UI state tracking
    • Assumptions/dependencies: Reliable computer-use (GAIA-like) capabilities; better multimodal perception; tight OS/app permissions
  • Persistent, organization-wide agent memory and knowledge graphs
    • Sectors: knowledge management, internal tooling
    • Tools/workflows: Long-term memory across sessions (feature listed as not yet supported), knowledge consolidation via condensers and embeddings, cross-project recall
    • Assumptions/dependencies: Secure and compliant memory stores; indexing/search over event logs; privacy-preserving recall policies
  • Auto-adaptive multi-LLM routing based on real-time performance and cost
    • Sectors: platform engineering, cost optimization
    • Tools/workflows: Feedback-driven RouterLLM selecting models by task type, latency, accuracy, and budget; continuous evaluation loops
    • Assumptions/dependencies: Reliable telemetry on per-step outcomes; automated routing policy training; provider diversity and stability
  • Sector-specific agents with regulatory-grade workflows
    • Sectors: healthcare (clinical coding, claim validation), energy (config updates for grid software), finance (policy-driven code changes)
    • Tools/workflows: Domain-specific tools via MCP, pre-approved change templates, confirmation gates by risk; comprehensive audit trails
    • Assumptions/dependencies: Deep domain tooling; regulator-aligned processes; organizational buy-in for agent-mediated changes
  • Human-in-the-loop governance dashboards for production agents
    • Sectors: platform engineering, security/compliance
    • Tools/workflows: Real-time event streaming (WebSocket) to monitor thoughts/actions/observations, pause/resume, approve/reject, rollbacks via deterministic replay
    • Assumptions/dependencies: Usability for non-experts; policy-compliant redaction; integration with incident and change management systems
  • Standardization efforts for agent interoperability (MCP-first ecosystems)
    • Sectors: standards bodies, open-source communities
    • Tools/workflows: Shared schemas for tools/events/security annotations; certification processes for tool servers; reference implementations
    • Assumptions/dependencies: Broad MCP adoption; community governance; vendor cooperation across model and tool providers
  • Large-scale ops automation (data pipelines, infra-as-code changes)
    • Sectors: data engineering, cloud ops
    • Tools/workflows: Agents coordinating changes across repos/services with confirmation policies; scheduling windows; post-change validation
    • Assumptions/dependencies: Robust multi-repo orchestration and rollback; strong safeguards; organizational change control practices
  • Public-sector agent governance pilots
    • Sectors: public sector, digital services
    • Tools/workflows: Limited-scope deployments with opt-in sandboxing, event-sourced accountability, risk gating; independent audits of agent behavior
    • Assumptions/dependencies: Policy frameworks for agent operations; procurement and security reviews; public trust and transparency practices
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Action–Execution–Observation pattern: A tooling abstraction where inputs (actions) are validated, executed, and returned as structured observations. "The V1 tool system provides a type-safe and extensible framework grounded in an Action–Execution–Observation pattern."
  • Agent Server: The server component that executes agents remotely and streams events over APIs. "At the deployment level, its four modular packages—SDK, Tools, Workspace, and Agent Server—combine flexibly to support local, hosted, or containerized execution."
  • AgentContext: A configuration object that shapes LLM behavior via prompts, skills, and optional tools. "AgentContext centralizes all inputs that shape LLM behavior, including prefixes/suffixes for system/user messages and user-defined Skill objects."
  • API-managed runtime: A remote execution environment managed via an API for running workspaces or agents. "…a containerized server (DockerWorkspace) or a API-managed runtime (APIRemoteWorkspace)."
  • append-only EventLog: An immutable log of events capturing all agent interactions in order. "…all changing variables live in ConversationState, making it the only stateful component. This class maintains two types of state… and (2) an append-only EventLog recording all agent interactions."
  • Chat Completions API: A standard LLM API interface for chat-based completions used across providers. "Through LiteLLM, it supports 100+ providers with two APIs: the standard Chat Completions API for broad compatibility…"
  • CondensationEvent: An event recording the result of history condensation and what was summarized. "The results of any given condensation are stored in the event log as a CondensationEvent."
  • Condenser system: A mechanism to summarize and drop history so it fits within the LLM context window. "To ensure the ever-growing history fits inside the LLM's context, the Condenser system drops events and replaces them with summaries whenever the history grows too large."
  • ConfirmRisky: A built-in confirmation policy that blocks execution above a specified risk threshold. "…LLMSecurityAnalyzer, which appends a security_risk field to tool calls, and ConfirmRisky policy, which blocks actions exceeding a configurable risk threshold (default: high)."
  • ConfirmationPolicy: A policy that determines when user approval is required before executing actions. "…the ConfirmationPolicy, which determines whether user approval is required before execution based on the action’s details and assessed risk."
  • ConversationState: The single source of mutable state for an agent’s execution, with metadata and event log. "…all changing variables live in ConversationState, making it the only stateful component."
  • Discriminated unions: A type system pattern enabling safe serialization/deserialization of variant event types. "…with type-safe serialization via discriminated unions \citep{pydantic_discriminated_unions}."
  • DockerWorkspace: A workspace implementation that runs agent workloads inside a Docker container. "from openhands.workspace import DockerWorkspace"
  • Event-Driven Execution: An execution model where agents advance by emitting and processing structured events. "Event-Driven Execution."
  • Event-Sourced State Management: A state model where all changes are captured as immutable events for replay and recovery. "Event-Sourced State Management"
  • EventStore: The persistence layer that writes individual event JSON files for incremental durability. "…while EventStore persists events as individual JSON files to the corresponding directory."
  • FastMCP: An implementation used to connect to MCP servers and manage transport. "MCPToolExecutor delegates execution to FastMCP’s MCPClient, which manages server communication and transport details."
  • FIFO lock: A lock ensuring first-in-first-out ordering for thread-safe state updates. "A FIFO lock ensures thread-safe updates through a two-path pattern…"
  • GAIA: A benchmark suite for evaluating general agentic task-solving capability. "…strong results on SWE-Bench Verified and GAIA benchmarks…"
  • Kubernetes: A container orchestration system used for production deployments. "…local Docker, Kubernetes in production…"
  • LiteLLM: A compatibility layer that routes requests to many LLM providers via unified APIs. "Through LiteLLM, it supports 100+ providers…"
  • LLMConvertibleEvent: An event type that can be converted into messages consumable by an LLM. "LLMConvertibleEvent adds to_llm_message() for converting events into LLM format."
  • LLMSecurityAnalyzer: An analyzer that uses LLMs to assess the risk of proposed actions. "The SDK includes a built-in pair: LLMSecurityAnalyzer, which appends a security_risk field to tool calls…"
  • LLMSummarizingCondenser: The default condenser that uses an LLM to summarize history. "LLMSummarizingCondenser (the default condenser) has been shown to reduce API costs by up to 2×2 \times with no degradation in agent performance…"
  • LocalConversation: An in-process execution mode for rapid iteration without network/container overhead. "LocalConversation provides the simplest and most direct execution mode of the SDK…"
  • LocalWorkspace: A workspace implementation that runs directly on the host filesystem and shell. "Local Workspace executes in-process against the host filesystem and shell…"
  • Model Context Protocol (MCP): A protocol for agents and tools to communicate with shared context and capabilities. "…the Model Context Protocol (MCP; \citealt{mcp2025intro})."
  • MCPClient: The client used to connect to and invoke MCP tools. "…FastMCP’s MCPClient, which manages server communication and transport details."
  • MCPToolDefinition: A tool definition that adapts MCP tool schemas into the SDK’s tool model. "MCPToolDefinition extends the standard ToolDefinition interface…"
  • MCPToolExecutor: An executor that forwards tool calls to an MCP server. "…while MCPToolExecutor delegates execution to FastMCP’s MCPClient…"
  • Model-agnostic multi-LLM routing: The ability to route requests across many model providers without vendor lock-in. "…model-agnostic multi-LLM routing across 100+ providers."
  • Multi-LLM routing: Selecting different LLMs for different requests within the same agent session. "Multi-LLM Routing Support."
  • NonNativeToolCallingMixin: A mixin that emulates function-calling by parsing tool calls from text outputs. "…the SDK implements a NonNativeToolCallingMixin, which converts tool schemas to text-based prompt instructions and parses tool calls from model outputs…"
  • ObservationBaseEvent: The base class for events representing tool execution results. "The action-observation loop uses ActionEvent for tool calls and ObservationBaseEvent subclasses for results…"
  • OpenAI Responses API: A newer OpenAI API used for advanced reasoning models. "…and the newer OpenAI Responses API for latest reasoning models."
  • opt-in sandboxing: A security model where sandboxing is optional and applied only when needed. "V1 refactors this into a modular SDK with clear boundaries, opt-in sandboxing, and reusable agent, tool, and workspace packages."
  • Pydantic models: Data classes with validation and serialization used to define immutable components. "V1 treats all agents and their components—tools, LLMs, etc—as immutable and serializable Pydantic models validated at construction."
  • ReasoningItemModel: A schema capturing structured reasoning traces from OpenAI models. "…and ReasoningItemModel for OpenAI's reasoning."
  • RemoteConversation: A conversation executed via an API server, streaming events over WebSocket. "…constructs a RemoteConversation, which serializes the agent configuration and delegates execution to an agent server over HTTP and WebSocket."
  • RemoteWorkspace: A workspace implementation delegating operations over HTTP to an Agent Server. "When provided a RemoteWorkspace, the same call transparently constructs a RemoteConversation…"
  • REST/WebSocket server: A server exposing HTTP endpoints and WebSocket streams for remote agent execution. "…a built-in REST/WebSocket server for remote execution…"
  • RouterLLM: An LLM wrapper that dynamically routes requests to selected underlying models. "SDK features RouterLLM, a subclass of LLM that enables the agent to use different models for different LLM requests."
  • SaaS-style multi-tenancy: Serving multiple users with isolated containers within a hosted service. "This containerized design simplifies deployment and enables SaaS-style multi-tenancy while preserving workspace isolation."
  • SecretRegistry: A per-conversation secret manager with masking and secure retrieval. "SecretRegistry provides secure, late-bound, and remotely manageable credentials for tool execution."
  • SecurityAnalyzer: A component that rates the risk of tool actions (e.g., low/medium/high/unknown). "…the SecurityAnalyzer, which rates each tool call as low, medium, high, or unknown risk…"
  • Stateless by default: A design principle where components are immutable, with all mutable context in a single state object. "Stateless by Default, One Source of Truth for State."
  • Sub-Agent Delegation: A mechanism allowing an agent to spawn and coordinate sub-agents as tools. "Sub-Agent Delegation."
  • SWE-Bench Verified: A benchmark evaluating software engineering agent capabilities. "Across multiple LLM backends, our SDK achieves strong results on SWE-Bench Verified and GAIA benchmarks…"
  • ThinkingBlock: Anthropic-specific structure for extended thinking content. "…such as ThinkingBlock for Anthropic's extended thinking…"
  • ToolExecutor: The callable that performs a tool’s logic when given a validated action. "…the Tool’s actual logic via the ToolExecutor, which receives a validated Action and performs the underlying execution."
  • Tool Registry: A registry used to resolve tool specifications into runnable implementations at runtime. "Tool Registry and Distributed Execution."
  • typed component model: A design allowing safe extension by replacing strongly typed components. "…the SDK exposes a typed component model—tools, LLMs, contexts, etc—so developers can extend or reconfigure agents declaratively…"
  • VNC desktop: A browser-accessible remote desktop provided with the agent server. "…a browser-based VSCode IDE, VNC desktop, and persistent Chromium browser—for human inspection and control."
  • VSCode IDE: A browser-based IDE integrated for interactive inspection and control. "…a browser-based VSCode IDE, VNC desktop, and persistent Chromium browser—for human inspection and control."
  • WAITING_FOR_CONFIRMATION: An agent state indicating execution is paused pending user approval. "When approval is required, the agent pauses in a special WAITING_FOR_CONFIRMATION state…"
  • Workspace factory: A factory that chooses local or remote workspace implementations transparently. "The factory Workspace(\ldots) resolves to local when only working_dir is provided and to remote when host/runtime parameters are present…"
Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 49 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com