AgentBay: Secure Hybrid Human-AI Sandbox

Updated 7 December 2025

AgentBay is a multi-tenant hybrid-interaction sandbox service that enables secure and seamless human-AI collaboration in mission-critical settings.
It integrates high-fidelity virtualization, strict security isolation, and a unified control interface to ensure robust performance and rapid human intervention.
Its core Adaptive Streaming Protocol dynamically balances command and video streams, achieving low latency and resilient performance under varying network conditions.

AgentBay is a multi-tenant, hybrid-interaction sandbox service for facilitating seamless human-AI intervention within agentic systems, providing a secure, isolated, and responsive execution environment for both AI agents and human operators in mission-critical applications. It is designed to address the brittleness of autonomous AI agents—particularly those powered by LLMs—by enabling real-time Human-in-the-Loop (HITL) oversight and intervention without interruption or reprovisioning. AgentBay combines high-fidelity virtualized environments, sandboxing primitives, a unified hybrid control interface, and its core innovation, the Adaptive Streaming Protocol (ASP), to deliver robust and efficient human–AI collaboration (Piao et al., 4 Dec 2025).

1. System Architecture and Security Isolation

AgentBay’s architecture is structured as a layered, multi-tenant service capable of instantiating Windows, Linux, Android (with containerized emulators), fully-featured Web Browsers, and a language-and-runtime “Code Space” interpreter as isolated execution environments. Isolation and containment are achieved via KVM-based virtual machines (VMs) or hardware-accelerated containers, each provisioned with:

Private virtual networks (VPC) using default-deny egress/ingress rules
Ephemeral private file systems, destroyed on teardown
Seccomp and cgroups for stringent CPU/memory/process quotas
TLS-tunneled access, enforced by a hardened gateway and Web Application Firewall (WAF)

The internal architecture comprises four layers:

Interface Layer: Exposes Model Context Protocol (MCP), open-source SDKs (TypeScript, Python, Go), and the ASP client library.
Service Layer: Manages sandbox lifecycle tasks, available tools (browser, emulator, code execution), and the streaming service for ASP.
Environment Layer: Hosts mirrored OS images and interpreters for high-fidelity reproducibility.
Feature Layer: Supports session management, context persistence, port mapping, file operations, command execution, and dynamic network configuration.

Each sandbox session is protected by a short-lived, scoped token. Both agent API calls (via MCP/SDK) and human control streams (ASP) are unified within one gateway, enabling a consistent, enforceable security policy (Piao et al., 4 Dec 2025).

Security isolation is empirically validated: in controlled experiments, native baseline environments were fully compromised by recursive deletion (“rm –fr /”) and outbound data exfiltration (“curl”) attacks; AgentBay’s sandbox yielded zero host impact and achieved 100% exfiltration block via default-deny networking (Piao et al., 4 Dec 2025).

2. Hybrid Control Interface

AgentBay allows a single, persistent session to be jointly driven by an LLM-based agent and a human operator without requiring session restarts or reconfiguration. Programmatic control is provided via the MCP REST/gRPC API and open-source SDKs (“click button X,” “navigate to URL Y,” “run shell command”), while real-time human control is delivered through a desktop stream (ASP) with mouse/keyboard event injection.

Session-level input multiplexing is implemented by an arbiter: agent inputs have priority by default, but any human interaction (mouse/key event) instantly grabs exclusive control, which persists until a subsequent period of agent-only activity is detected. All state, logs, screenshots, and video fragments from both controllers are aggregated within a central Session Context, permitting agents to resume precisely from where a human left off without context loss.

Empirical results indicate that average end-to-end human takeover latency is 15–30 s; upon resumption, >95% of cases achieve recovery success. Human operators can assert manual control with sub-50 ms latency, minimizing context-switch friction and eliminating the need for new provisioning (Piao et al., 4 Dec 2025).

3. Adaptive Streaming Protocol (ASP): Formal Model and Techniques

ASP is engineered to provide ultra-low-latency, resilient, and bandwidth-adaptive streaming suitable for hybrid (human–agent) control. It uniquely blends a low-bandwidth, event-driven “command stream” with a video-like “frame stream,” dynamically mixing these based on measured bandwidth, network conditions, and current controller.

3.1 Rate Allocation Model

Let:

$C(t)$ : command-stream data rate (key-events)
$V(t)$ : video-stream bitrate
$B(t)$ : end-to-end available bandwidth
$\alpha(t) \in [0, 1]$ : mixing factor (video fraction)

The protocol minimizes a combined distortion-latency objective: $\min_{\alpha} \; D\bigl(\alpha;V,C\bigr) + \lambda\,L\bigl(\alpha;V,C\bigr) \quad \text{s.t.}\quad \alpha\,V + (1-\alpha)\,C \le B$ where $\lambda$ controls the fidelity–responsiveness tradeoff (Piao et al., 4 Dec 2025).

3.2 Region-of-Interest Encoding

The display frame is divided into $N$ ROIs. For each ROI $i$ , encoding mode $m_i \in \{\text{video}, \text{command}\}$ is selected. The aggregate bitrate is: $R = \sum_{i=1}^N [\mathbb{I}(m_i=\text{video})\,r_i^V + \mathbb{I}(m_i=\text{command})\,r_i^C]$ where

$r_i^V = q_V \cdot \mathrm{size}(\Delta_i), \quad r_i^C = q_C \cdot \mathrm{size}(\Delta_i)$

( $\Delta_i$ : pixel delta, $q_V > q_C$ : ROI-specific quantization).

The decoder reconstructs the UI by upsampling command strokes and overlaying video patches as needed.

3.3 Bandwidth Adaptation Strategies

One-way delay ( $\tau$ ) and packet loss ( $p$ ) are measured via periodic ping. The mixing factor is updated: $\alpha(t+1) = \sigma[\kappa_1(B(t) - R(t)) - \kappa_2(\tau(t) - \tau_{\max}) - \kappa_3 p(t)]$ ( $\sigma(x)=(1+\exp(-x))^{-1}$ ). When a human is in control, $\lambda$ is reduced to prioritize latency; under agent-only control, ASP increases visual fidelity (Piao et al., 4 Dec 2025).

4. Empirical Evaluation and Benchmarking

AgentBay’s system is evaluated across four dimensions: security isolation, HITL task completion, agent benchmarks, and ASP protocol performance.

4.1 Security

Vector A (recursive delete): native baseline—full compromise; AgentBay—no host compromise.
Vector B (data exfiltration): native baseline—full compromise; AgentBay—100% block (default-deny egress).

4.2 HITL Web Automation

Using Claude Sonnet 4.5 + ReAct, task success rates improved substantially in hybrid mode:

Failure Mode	Agent-Only	Hybrid	% Improvement
Floating-ad reading	27%	97%	+259%
CAPTCHA handling	64%	95%	+48%
Password input	0%	100%	100% (abs.)

4.3 Open-Source Agent Benchmarks

SeeAct on Online-Mind2Web (14 Easy tasks): Physical machine 35.71%; AgentBay 42.86% (Δ +7.15 pp)

4.4 ASP vs. RDP Performance

Metric	ASP	RDP	Relative Gain
Latency (click-to-photon, ms)	117	122	≈5% faster
Stutter Rate (10% packet loss, 20 fps)	16.45%	96.16%	—
Bandwidth (video playback, Mbps)	4.6	10.2	−55%
SSIM (0–1, browsing)	0.833	0.827	—

All improvements are statistically significant ( $p < 0.01$ , paired t-tests). This suggests that ASP enables both lower bandwidth and greater resilience under adverse network conditions compared to RDP (Piao et al., 4 Dec 2025).

5. Robustness, Design Insights, and Collaboration Efficiency

Isolation Guarantees: Multi-tenant VM isolation and default-deny networks prevent agent escape and host compromise.
Bandwidth Adaptation: Region-aware adaptive compression ensures interactivity when $B < 5$ Mbps; under pronounced packet loss, hybrid TCP+UDP transport in ASP outperforms legacy remoting.
Unified Session Model: Instant human intervention (<50 ms) and centralized session context minimize provisioning and synchronization costs. A plausible implication is increased operator efficiency and reduced task recovery time.
Takeover Dynamics: Average takeover is 15–30 s end-to-end; >95% session recovery success, indicating the viability of seamless hybrid interaction.

6. Applications, Limitations, and Future Directions

Use cases for AgentBay include:

Human-signoff in enterprise RPA (finance, healthcare)
Secure code-generation within CI/CD pipelines
LLM-driven UX testing across mobile and web interfaces
Reinforcement learning (RL) environment scaffolding for game AI

Identified limitations:

VM startup overhead of approximately 3 s; dynamic prefetching is under consideration.
ASP operates at 30 fps; ongoing work targets 60 fps and WebRTC hybridization.
ROI classification relies on rule-based methods; machine-learned ROI prediction is proposed for improved QoS.

Future directions comprise dynamic pre-warmed VM pools, ML-based ROI prediction for optimized bandwidth allocation, integration of real-time audio streaming for voice-driven HITL, and development of open standards for hybrid-interaction protocols to foster broader adoption.

In summary, AgentBay’s layered architecture, hybrid control interface, and formally grounded adaptive streaming protocol establish a resilient, low-latency, and secure environment for seamless human–AI interaction, providing robust primitives for next-generation, mission-critical autonomous agent systems (Piao et al., 4 Dec 2025).

Markdown Upgrade to Chat

References (1)

AgentBay: A Hybrid Interaction Sandbox for Seamless Human-AI Intervention in Agentic Systems (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgentBay.