ceLLMate: BUA Sandbox for Browser Security

Updated 21 December 2025

ceLLMate is a browser-level sandboxing framework that implements least-privilege security at the HTTP layer to protect browser-using agents (BUAs) from prompt injection attacks.
It registers agent capabilities via sitemaps and uses LLM-driven policy selection to map low-level UI events to high-level semantic actions.
Evaluations show high domain prediction accuracy, minimal performance overhead, and robust security against unauthorized web actions.

ceLLMate is a browser-level sandboxing framework designed for Browser-Using Agents (BUAs)—autonomous agents capable of interacting with web browsers through human-like operations such as clicking, scrolling, filling forms, and navigating web pages. BUAs, including recent systems such as Gemini-CUA, OpenAI Atlas, and Perplexity Comet, automate repetitive and complex online tasks by leveraging perception of web content (e.g., screenshots, DOM elements) and issuing low-level user interface (UI) commands. However, this operational model exposes BUAs to severe prompt injection attacks, as well as vulnerabilities arising from excessive ambient privilege. ceLLMate enforces least-privilege security policies at the network (HTTP) layer, providing deterministic guardrails for agent actions independent of model-level defenses, and is implemented as an agent-agnostic Chrome browser extension (Meng et al., 14 Dec 2025).

1. Security Rationale and Threat Model

BUAs operate in the authenticated context of a user, thus inheriting full ambient privilege—any UI action (e.g., a click) triggered by the agent is executed with the same authority as the user. This creates a critical security problem: prompt injection attacks. Here, malicious input injected into web content (such as issue descriptions or user reviews) can manipulate the LLM underpinning the agent, causing it to leak sensitive data, execute unintended state-changing commands, or exfiltrate session tokens. Furthermore, direct policy enforcement over UI-level events is fundamentally brittle due to the semantic gap: the mapping between low-level UI interactions and meaningful application actions is neither stable nor robust across page layouts or navigation states.

ceLLMate's primary security goals are:

Restricting the set of web services and semantic actions BUAs can invoke.
Blocking or conditionally permitting out-of-scope HTTP requests.
Automating policy selection based on natural-language task descriptions, minimizing user configuration overhead.

By instrumenting the HTTP layer, ceLLMate shifts from model-centric mitigation (which often leads to arms races) toward system-level enforcement analogous to process sandboxing in traditional operating systems.

2. Architectural Overview

ceLLMate's architecture is organized into three primary phases: registration of capabilities, policy selection, and network-level enforcement.

1. Registration:

Web application developers publish, at a canonical URL on their domain, an "Agent Sitemap"—a JSON artifact mapping HTTP request patterns to high-level semantic actions—and a set of pre-defined policies. Each policy specifies an effect (allow, deny, condition) over particular semantic actions, with possible parameters (e.g., value bounds).

2. Policy Selection:

Given a user's free-form task description $T$ , ceLLMate proceeds as follows:

Domain Prediction: Identify the web domains the agent will operate on.
Sitemap and Policy Retrieval: For each domain, fetch the agent sitemap and policy set.
Policy Minimization: Use a chain-of-thought LLM to determine the minimal set of policies sufficient for $T$ ; extract parameters for conditional policies as needed.
User Confirmation: Aggregate and present the composite policy for user confirmation/adjustment.

3. Enforcement:

A Chrome extension intercepts outbound HTTP(S) requests from the agent's browser session. Each request is mapped—through the agent sitemap—to a semantic action, which is then evaluated against the composite policy. Requests are allowed, denied, or conditionally evaluated according to the policy table.

3. Semantic Gap, Policy Model, and Formalism

Semantic Gap

Let $E$ denote the set of low-level UI events (clicks, keystrokes, scrolls), $H$ the space of HTTP messages, and $A$ the set of semantic actions (e.g., AddToCart, PlaceOrder). The mapping resolves as follows:

$f: E \rightarrow H$ , associating each event with its resultant HTTP request.
$\sigma: H \rightarrow A$ , mapping an HTTP message to its semantic action per the agent sitemap.
The composite mapping $\sigma \circ f: E \rightarrow A$ provides a stable semantic reference for policy application.

Policies $p$ are defined as functions $p: A \rightarrow \{\text{allow}, \text{deny}, \text{cond}\}$ . Enforcement for event $e$ proceeds via $h=f(e)$ , $a=\sigma(h)$ , $decision=p(a)$ .

Policy Abstraction

Policies are represented as tuples:

$P = (\text{Name}, \text{Effect}, \text{Actions}, \text{Cond?}, \text{Args?})$

$\text{Effect} \in \{\text{allow}, \text{deny}, \text{condition}\}$
$\text{Actions} \subseteq A$
$\text{Cond}$ is an optional predicate $c(\text{params}, \text{args}) \to \{\text{true}, \text{false}\}$
$\text{params}$ are developer-supplied parameters; $\text{args}$ are run-time arguments extracted from the DOM or HTTP request

Example: A policy restricting Amazon purchases to at most \$50—Name: purchase_amount_leq, Effect: condition, Actions: {PlaceOrder}, params: {"maxAmount": 50}, args: {"totalAmount": $t$ }, with $c(\text{params, args}) \equiv (\text{args.totalAmount} \leq \text{params.maxAmount})$ .

Automated Policy Selection

Policy selection is modeled as a minimal cover problem: for a task $T$ and domain $D$ with policy set $P_D$ , select the smallest $S \subseteq P_D$ enabling $T$ . This is performed via LLM prompting. The practical inference pipeline includes policy retrieval, prompt formatting, LLM prediction for policy and parameter selection, and composition of the final policy. See the pseudocode in the data for full algorithmic details.

4. Implementation as a Chrome Extension

The core enforcement mechanism is a Chrome browser extension comprising:

background.js: The central policy engine. Loads the composite policy as an in-memory lookup table indexed by URL, method, and request body (from the agent sitemap). It intercepts HTTP(S) requests using chrome.webRequest.onBeforeRequest, matches requests to semantic actions, and enforces policy decisions. For conditional policies, it synchronously executes the corresponding JavaScript predicate.
content_script.js: Injected into relevant pages to retrieve dynamic DOM values (such as cart totals) via specified CSS selectors, reporting them to background.js for context-aware policy enforcement.
Popup UI: Presents policy decisions for user review, displaying policy names, descriptions, and instantiated parameters.
State Management: Maintains a cache of dynamic arguments keyed by URL, with updates triggered by DOM mutations or navigation events. Session tokens bind agent tabs to the applicable policy set.

5. Evaluation: Policy Selection, Performance, Security

Policy Selection and Instantiation

A curated benchmark based on WebBench tasks across three domains (retail—Amazon/eBay, travel—Airbnb/Expedia, version control—GitHub/GitLab) was constructed with minimal policy labelings. Three LLMs (GPT-5.1, Gemini-2.5 Pro, Claude-Opus-4-5) were evaluated:

Domain prediction accuracy: $\geq 93\%$ without explicit reference.
Policy selection accuracy (no domain hints): 94–99%.
Policy selection accuracy (with brief guidance): 97–100%.
Argument extraction accuracy: 80–96% raw; 100% with explicit cues.

Performance Overhead

End-to-end timing of a standard GitLab navigation script (11 steps) with Playwright shows overhead grows linearly with policy set size:

Lookup Entries ( $n$ )	Baseline	With ceLLMate	Overhead
0	13.93 s	13.93 s	0%
100	—	14.94 s	+7.2%
200	—	15.35 s	+10.1%
300	—	16.02 s	+15.0%

Memory overhead was approximately 25 MB—negligible relative to standard browser workloads.

Security Effectiveness

HTTP-layer enforcement ensures attacks attempting disallowed actions are deterministically blocked. For instance, a GitLab issue prompting PAT creation is dropped under a "comment_issue"-only policy (attack success rate reduces to 0%), and high-value purchases triggered by prompt-injected reviews are blocked under a conditional purchase policy.

6. Design Insights, Limitations, and Prospects

Key insights include:

HTTP-layer mediation bridges the semantic gap in UI-driven automation, guaranteeing policy enforcement despite page or UI variation.
Agent sitemaps act as durable "API documentation for agents," enabling stable semantic abstraction for policy authors.
LLM-based policy selection enables adaptive, least-privilege default policies with minimal end-user intervention.

Identified limitations and planned future work involve:

Interception of non-HTTP channels (notably WebSockets) to extend applicability to real-time apps.
Introduction of stateful policies (e.g., enforcing limits over multiple actions) via local state.
Fresh argument enforcement via browser lockout middleware, avoiding race conditions in dynamic UIs.
Generalization to multi-turn or concurrent agent workflows, beyond the current single-threaded, single-turn paradigm.
Automated sitemap generation leveraging framework-specific tools for extracting REST endpoints and selectors.

In summary, ceLLMate represents the inaugural system-level sandbox for BUAs, integrating agent sitemaps, pre-defined semantic policies, LLM-driven policy minimization, and browser extension-based HTTP enforcement. This delivers provable, least-privilege guardrails, blocking prompt injection attacks at negligible performance cost (Meng et al., 14 Dec 2025).

Markdown Upgrade to Chat

References (1)

ceLLMate: Sandboxing Browser AI Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ceLLMate.