AgentBay: Secure Hybrid Human-AI Sandbox
- AgentBay is a multi-tenant hybrid-interaction sandbox service that enables secure and seamless human-AI collaboration in mission-critical settings.
- It integrates high-fidelity virtualization, strict security isolation, and a unified control interface to ensure robust performance and rapid human intervention.
- Its core Adaptive Streaming Protocol dynamically balances command and video streams, achieving low latency and resilient performance under varying network conditions.
AgentBay is a multi-tenant, hybrid-interaction sandbox service for facilitating seamless human-AI intervention within agentic systems, providing a secure, isolated, and responsive execution environment for both AI agents and human operators in mission-critical applications. It is designed to address the brittleness of autonomous AI agents—particularly those powered by LLMs—by enabling real-time Human-in-the-Loop (HITL) oversight and intervention without interruption or reprovisioning. AgentBay combines high-fidelity virtualized environments, sandboxing primitives, a unified hybrid control interface, and its core innovation, the Adaptive Streaming Protocol (ASP), to deliver robust and efficient human–AI collaboration (Piao et al., 4 Dec 2025).
1. System Architecture and Security Isolation
AgentBay’s architecture is structured as a layered, multi-tenant service capable of instantiating Windows, Linux, Android (with containerized emulators), fully-featured Web Browsers, and a language-and-runtime “Code Space” interpreter as isolated execution environments. Isolation and containment are achieved via KVM-based virtual machines (VMs) or hardware-accelerated containers, each provisioned with:
- Private virtual networks (VPC) using default-deny egress/ingress rules
- Ephemeral private file systems, destroyed on teardown
- Seccomp and cgroups for stringent CPU/memory/process quotas
- TLS-tunneled access, enforced by a hardened gateway and Web Application Firewall (WAF)
The internal architecture comprises four layers:
- Interface Layer: Exposes Model Context Protocol (MCP), open-source SDKs (TypeScript, Python, Go), and the ASP client library.
- Service Layer: Manages sandbox lifecycle tasks, available tools (browser, emulator, code execution), and the streaming service for ASP.
- Environment Layer: Hosts mirrored OS images and interpreters for high-fidelity reproducibility.
- Feature Layer: Supports session management, context persistence, port mapping, file operations, command execution, and dynamic network configuration.
Each sandbox session is protected by a short-lived, scoped token. Both agent API calls (via MCP/SDK) and human control streams (ASP) are unified within one gateway, enabling a consistent, enforceable security policy (Piao et al., 4 Dec 2025).
Security isolation is empirically validated: in controlled experiments, native baseline environments were fully compromised by recursive deletion (“rm –fr /”) and outbound data exfiltration (“curl”) attacks; AgentBay’s sandbox yielded zero host impact and achieved 100% exfiltration block via default-deny networking (Piao et al., 4 Dec 2025).
2. Hybrid Control Interface
AgentBay allows a single, persistent session to be jointly driven by an LLM-based agent and a human operator without requiring session restarts or reconfiguration. Programmatic control is provided via the MCP REST/gRPC API and open-source SDKs (“click button X,” “navigate to URL Y,” “run shell command”), while real-time human control is delivered through a desktop stream (ASP) with mouse/keyboard event injection.
Session-level input multiplexing is implemented by an arbiter: agent inputs have priority by default, but any human interaction (mouse/key event) instantly grabs exclusive control, which persists until a subsequent period of agent-only activity is detected. All state, logs, screenshots, and video fragments from both controllers are aggregated within a central Session Context, permitting agents to resume precisely from where a human left off without context loss.
Empirical results indicate that average end-to-end human takeover latency is 15–30 s; upon resumption, >95% of cases achieve recovery success. Human operators can assert manual control with sub-50 ms latency, minimizing context-switch friction and eliminating the need for new provisioning (Piao et al., 4 Dec 2025).
3. Adaptive Streaming Protocol (ASP): Formal Model and Techniques
ASP is engineered to provide ultra-low-latency, resilient, and bandwidth-adaptive streaming suitable for hybrid (human–agent) control. It uniquely blends a low-bandwidth, event-driven “command stream” with a video-like “frame stream,” dynamically mixing these based on measured bandwidth, network conditions, and current controller.
3.1 Rate Allocation Model
Let:
- : command-stream data rate (key-events)
- : video-stream bitrate
- : end-to-end available bandwidth
- : mixing factor (video fraction)
The protocol minimizes a combined distortion-latency objective: where controls the fidelity–responsiveness tradeoff (Piao et al., 4 Dec 2025).
3.2 Region-of-Interest Encoding
The display frame is divided into ROIs. For each ROI , encoding mode is selected. The aggregate bitrate is: where
(: pixel delta, : ROI-specific quantization).
The decoder reconstructs the UI by upsampling command strokes and overlaying video patches as needed.
3.3 Bandwidth Adaptation Strategies
One-way delay () and packet loss () are measured via periodic ping. The mixing factor is updated: (). When a human is in control, is reduced to prioritize latency; under agent-only control, ASP increases visual fidelity (Piao et al., 4 Dec 2025).
4. Empirical Evaluation and Benchmarking
AgentBay’s system is evaluated across four dimensions: security isolation, HITL task completion, agent benchmarks, and ASP protocol performance.
4.1 Security
- Vector A (recursive delete): native baseline—full compromise; AgentBay—no host compromise.
- Vector B (data exfiltration): native baseline—full compromise; AgentBay—100% block (default-deny egress).
4.2 HITL Web Automation
Using Claude Sonnet 4.5 + ReAct, task success rates improved substantially in hybrid mode:
| Failure Mode | Agent-Only | Hybrid | % Improvement |
|---|---|---|---|
| Floating-ad reading | 27% | 97% | +259% |
| CAPTCHA handling | 64% | 95% | +48% |
| Password input | 0% | 100% | 100% (abs.) |
4.3 Open-Source Agent Benchmarks
- SeeAct on Online-Mind2Web (14 Easy tasks): Physical machine 35.71%; AgentBay 42.86% (Δ +7.15 pp)
4.4 ASP vs. RDP Performance
| Metric | ASP | RDP | Relative Gain |
|---|---|---|---|
| Latency (click-to-photon, ms) | 117 | 122 | ≈5% faster |
| Stutter Rate (10% packet loss, 20 fps) | 16.45% | 96.16% | — |
| Bandwidth (video playback, Mbps) | 4.6 | 10.2 | −55% |
| SSIM (0–1, browsing) | 0.833 | 0.827 | — |
All improvements are statistically significant (, paired t-tests). This suggests that ASP enables both lower bandwidth and greater resilience under adverse network conditions compared to RDP (Piao et al., 4 Dec 2025).
5. Robustness, Design Insights, and Collaboration Efficiency
- Isolation Guarantees: Multi-tenant VM isolation and default-deny networks prevent agent escape and host compromise.
- Bandwidth Adaptation: Region-aware adaptive compression ensures interactivity when Mbps; under pronounced packet loss, hybrid TCP+UDP transport in ASP outperforms legacy remoting.
- Unified Session Model: Instant human intervention (<50 ms) and centralized session context minimize provisioning and synchronization costs. A plausible implication is increased operator efficiency and reduced task recovery time.
- Takeover Dynamics: Average takeover is 15–30 s end-to-end; >95% session recovery success, indicating the viability of seamless hybrid interaction.
6. Applications, Limitations, and Future Directions
Use cases for AgentBay include:
- Human-signoff in enterprise RPA (finance, healthcare)
- Secure code-generation within CI/CD pipelines
- LLM-driven UX testing across mobile and web interfaces
- Reinforcement learning (RL) environment scaffolding for game AI
Identified limitations:
- VM startup overhead of approximately 3 s; dynamic prefetching is under consideration.
- ASP operates at 30 fps; ongoing work targets 60 fps and WebRTC hybridization.
- ROI classification relies on rule-based methods; machine-learned ROI prediction is proposed for improved QoS.
Future directions comprise dynamic pre-warmed VM pools, ML-based ROI prediction for optimized bandwidth allocation, integration of real-time audio streaming for voice-driven HITL, and development of open standards for hybrid-interaction protocols to foster broader adoption.
In summary, AgentBay’s layered architecture, hybrid control interface, and formally grounded adaptive streaming protocol establish a resilient, low-latency, and secure environment for seamless human–AI interaction, providing robust primitives for next-generation, mission-critical autonomous agent systems (Piao et al., 4 Dec 2025).