Papers
Topics
Authors
Recent
2000 character limit reached

AgentBay: Secure Hybrid Human-AI Sandbox

Updated 7 December 2025
  • AgentBay is a multi-tenant hybrid-interaction sandbox service that enables secure and seamless human-AI collaboration in mission-critical settings.
  • It integrates high-fidelity virtualization, strict security isolation, and a unified control interface to ensure robust performance and rapid human intervention.
  • Its core Adaptive Streaming Protocol dynamically balances command and video streams, achieving low latency and resilient performance under varying network conditions.

AgentBay is a multi-tenant, hybrid-interaction sandbox service for facilitating seamless human-AI intervention within agentic systems, providing a secure, isolated, and responsive execution environment for both AI agents and human operators in mission-critical applications. It is designed to address the brittleness of autonomous AI agents—particularly those powered by LLMs—by enabling real-time Human-in-the-Loop (HITL) oversight and intervention without interruption or reprovisioning. AgentBay combines high-fidelity virtualized environments, sandboxing primitives, a unified hybrid control interface, and its core innovation, the Adaptive Streaming Protocol (ASP), to deliver robust and efficient human–AI collaboration (Piao et al., 4 Dec 2025).

1. System Architecture and Security Isolation

AgentBay’s architecture is structured as a layered, multi-tenant service capable of instantiating Windows, Linux, Android (with containerized emulators), fully-featured Web Browsers, and a language-and-runtime “Code Space” interpreter as isolated execution environments. Isolation and containment are achieved via KVM-based virtual machines (VMs) or hardware-accelerated containers, each provisioned with:

  • Private virtual networks (VPC) using default-deny egress/ingress rules
  • Ephemeral private file systems, destroyed on teardown
  • Seccomp and cgroups for stringent CPU/memory/process quotas
  • TLS-tunneled access, enforced by a hardened gateway and Web Application Firewall (WAF)

The internal architecture comprises four layers:

  • Interface Layer: Exposes Model Context Protocol (MCP), open-source SDKs (TypeScript, Python, Go), and the ASP client library.
  • Service Layer: Manages sandbox lifecycle tasks, available tools (browser, emulator, code execution), and the streaming service for ASP.
  • Environment Layer: Hosts mirrored OS images and interpreters for high-fidelity reproducibility.
  • Feature Layer: Supports session management, context persistence, port mapping, file operations, command execution, and dynamic network configuration.

Each sandbox session is protected by a short-lived, scoped token. Both agent API calls (via MCP/SDK) and human control streams (ASP) are unified within one gateway, enabling a consistent, enforceable security policy (Piao et al., 4 Dec 2025).

Security isolation is empirically validated: in controlled experiments, native baseline environments were fully compromised by recursive deletion (“rm –fr /”) and outbound data exfiltration (“curl”) attacks; AgentBay’s sandbox yielded zero host impact and achieved 100% exfiltration block via default-deny networking (Piao et al., 4 Dec 2025).

2. Hybrid Control Interface

AgentBay allows a single, persistent session to be jointly driven by an LLM-based agent and a human operator without requiring session restarts or reconfiguration. Programmatic control is provided via the MCP REST/gRPC API and open-source SDKs (“click button X,” “navigate to URL Y,” “run shell command”), while real-time human control is delivered through a desktop stream (ASP) with mouse/keyboard event injection.

Session-level input multiplexing is implemented by an arbiter: agent inputs have priority by default, but any human interaction (mouse/key event) instantly grabs exclusive control, which persists until a subsequent period of agent-only activity is detected. All state, logs, screenshots, and video fragments from both controllers are aggregated within a central Session Context, permitting agents to resume precisely from where a human left off without context loss.

Empirical results indicate that average end-to-end human takeover latency is 15–30 s; upon resumption, >95% of cases achieve recovery success. Human operators can assert manual control with sub-50 ms latency, minimizing context-switch friction and eliminating the need for new provisioning (Piao et al., 4 Dec 2025).

3. Adaptive Streaming Protocol (ASP): Formal Model and Techniques

ASP is engineered to provide ultra-low-latency, resilient, and bandwidth-adaptive streaming suitable for hybrid (human–agent) control. It uniquely blends a low-bandwidth, event-driven “command stream” with a video-like “frame stream,” dynamically mixing these based on measured bandwidth, network conditions, and current controller.

3.1 Rate Allocation Model

Let:

  • C(t)C(t): command-stream data rate (key-events)
  • V(t)V(t): video-stream bitrate
  • B(t)B(t): end-to-end available bandwidth
  • α(t)[0,1]\alpha(t) \in [0, 1]: mixing factor (video fraction)

The protocol minimizes a combined distortion-latency objective: minα  D(α;V,C)+λL(α;V,C)s.t.αV+(1α)CB\min_{\alpha} \; D\bigl(\alpha;V,C\bigr) + \lambda\,L\bigl(\alpha;V,C\bigr) \quad \text{s.t.}\quad \alpha\,V + (1-\alpha)\,C \le B where λ\lambda controls the fidelity–responsiveness tradeoff (Piao et al., 4 Dec 2025).

3.2 Region-of-Interest Encoding

The display frame is divided into NN ROIs. For each ROI ii, encoding mode mi{video,command}m_i \in \{\text{video}, \text{command}\} is selected. The aggregate bitrate is: R=i=1N[I(mi=video)riV+I(mi=command)riC]R = \sum_{i=1}^N [\mathbb{I}(m_i=\text{video})\,r_i^V + \mathbb{I}(m_i=\text{command})\,r_i^C] where

riV=qVsize(Δi),riC=qCsize(Δi)r_i^V = q_V \cdot \mathrm{size}(\Delta_i), \quad r_i^C = q_C \cdot \mathrm{size}(\Delta_i)

(Δi\Delta_i: pixel delta, qV>qCq_V > q_C: ROI-specific quantization).

The decoder reconstructs the UI by upsampling command strokes and overlaying video patches as needed.

3.3 Bandwidth Adaptation Strategies

One-way delay (τ\tau) and packet loss (pp) are measured via periodic ping. The mixing factor is updated: α(t+1)=σ[κ1(B(t)R(t))κ2(τ(t)τmax)κ3p(t)]\alpha(t+1) = \sigma[\kappa_1(B(t) - R(t)) - \kappa_2(\tau(t) - \tau_{\max}) - \kappa_3 p(t)] (σ(x)=(1+exp(x))1\sigma(x)=(1+\exp(-x))^{-1}). When a human is in control, λ\lambda is reduced to prioritize latency; under agent-only control, ASP increases visual fidelity (Piao et al., 4 Dec 2025).

4. Empirical Evaluation and Benchmarking

AgentBay’s system is evaluated across four dimensions: security isolation, HITL task completion, agent benchmarks, and ASP protocol performance.

4.1 Security

  • Vector A (recursive delete): native baseline—full compromise; AgentBay—no host compromise.
  • Vector B (data exfiltration): native baseline—full compromise; AgentBay—100% block (default-deny egress).

4.2 HITL Web Automation

Using Claude Sonnet 4.5 + ReAct, task success rates improved substantially in hybrid mode:

Failure Mode Agent-Only Hybrid % Improvement
Floating-ad reading 27% 97% +259%
CAPTCHA handling 64% 95% +48%
Password input 0% 100% 100% (abs.)

4.3 Open-Source Agent Benchmarks

  • SeeAct on Online-Mind2Web (14 Easy tasks): Physical machine 35.71%; AgentBay 42.86% (Δ +7.15 pp)

4.4 ASP vs. RDP Performance

Metric ASP RDP Relative Gain
Latency (click-to-photon, ms) 117 122 ≈5% faster
Stutter Rate (10% packet loss, 20 fps) 16.45% 96.16%
Bandwidth (video playback, Mbps) 4.6 10.2 −55%
SSIM (0–1, browsing) 0.833 0.827

All improvements are statistically significant (p<0.01p < 0.01, paired t-tests). This suggests that ASP enables both lower bandwidth and greater resilience under adverse network conditions compared to RDP (Piao et al., 4 Dec 2025).

5. Robustness, Design Insights, and Collaboration Efficiency

  • Isolation Guarantees: Multi-tenant VM isolation and default-deny networks prevent agent escape and host compromise.
  • Bandwidth Adaptation: Region-aware adaptive compression ensures interactivity when B<5B < 5 Mbps; under pronounced packet loss, hybrid TCP+UDP transport in ASP outperforms legacy remoting.
  • Unified Session Model: Instant human intervention (<50 ms) and centralized session context minimize provisioning and synchronization costs. A plausible implication is increased operator efficiency and reduced task recovery time.
  • Takeover Dynamics: Average takeover is 15–30 s end-to-end; >95% session recovery success, indicating the viability of seamless hybrid interaction.

6. Applications, Limitations, and Future Directions

Use cases for AgentBay include:

  • Human-signoff in enterprise RPA (finance, healthcare)
  • Secure code-generation within CI/CD pipelines
  • LLM-driven UX testing across mobile and web interfaces
  • Reinforcement learning (RL) environment scaffolding for game AI

Identified limitations:

  • VM startup overhead of approximately 3 s; dynamic prefetching is under consideration.
  • ASP operates at 30 fps; ongoing work targets 60 fps and WebRTC hybridization.
  • ROI classification relies on rule-based methods; machine-learned ROI prediction is proposed for improved QoS.

Future directions comprise dynamic pre-warmed VM pools, ML-based ROI prediction for optimized bandwidth allocation, integration of real-time audio streaming for voice-driven HITL, and development of open standards for hybrid-interaction protocols to foster broader adoption.

In summary, AgentBay’s layered architecture, hybrid control interface, and formally grounded adaptive streaming protocol establish a resilient, low-latency, and secure environment for seamless human–AI interaction, providing robust primitives for next-generation, mission-critical autonomous agent systems (Piao et al., 4 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to AgentBay.