Browser-Using Agents (BUAs)

Updated 17 September 2025

Browser-Using Agents are autonomous or semi-autonomous entities that interact with browsers to automate navigation, data manipulation, and interactive workflows.
They integrate LLM-based reasoning with hybrid API and service-oriented architectures to simulate human browsing and enhance task execution.
BUAs support applications in authentication, UX testing, privacy mediation, and security while addressing challenges like prompt injection and identity ambiguity.

A Browser-Using Agent (BUA) is an autonomous or semi-autonomous entity that interacts with web browsers or browser-like interfaces to automate, simulate, or mediate tasks involving web content, navigation, data manipulation, and interactive workflows. BUAs cover a broad spectrum: from agents invoked for personalized local services in browser environments, to LLM-driven tools capable of multi-step, humanlike browsing and task execution that integrates modern interface processing, API interaction, real-time data access, and even human-in-the-loop cooperation. This category encompasses architectures and methods that treat the browser and web interface as programmatically accessible substrates for agentic reasoning, control, and delegation, supporting use cases ranging from secure authentication and usability testing to privacy mediation and adversarial robustness.

1. Architectural Principles and Core Components

BUAs implement diverse architectural paradigms but routinely share abstractions that accommodate the complexity and dynamics of web environments:

Service-Oriented Invocation via Brokers: Some BUA architectures enable dynamic discovery and invocation of "personal services"—local HTTP services providing functionality such as authentication or content handling—through a mediator called a Broker agent. The Broker maintains a registry, performs attribute-based resolution akin to "Yellow Pages" or "White Pages," and manages opaque handles for endpoint resolution. This enables web service providers (SPs) to request services from the user's environment using newly defined HTTP status codes (e.g., 310, 311, 312, 313), without revealing network details or requiring persistent plugins (Zúquete et al., 2019).
Client–Server Instrumentation: Platforms for behavioral data collection deploy browser extensions (e.g., Chrome extensions) to inject code that listens to user events (keystrokes, mouse interactions) and send structured logs to back-end servers. Data is structured in events, time-stamped, and transferred through browser-restricted messaging interfaces, with server-side relational or NoSQL storage for experiment-driven replay and analysis (Fan, 2019).
LLM-Based Reasoning and Interface Parsing: New BUA implementations incorporate LLMs as decision engines, interfacing with the browser through modular connectors that translate raw HTML or accessibility trees into compressed, semantically meaningful representations. This facilitates rapid, modular interpretation of the environment and output of action commands (click, fill, submit, etc.), blurring the boundary between AI planning and traditional automation (Lu et al., 18 Feb 2025, Wang et al., 13 Apr 2025).
Action Spaces, API Hybridization, and Abstraction: Some recent frameworks expand the BUA model to support API-based and hybrid interaction paradigms, dynamically switching between simulated human-like browsing and direct API invocation. Hybrid agents balance between high-level browsing actions and explicit API calls, selected through context-sensitive policies and performance thresholds (Song et al., 2024, Lù et al., 12 Jun 2025).

2. Discovery, Invocation, and Mediation of Local Services

A canonical early architecture for BUA capability involved a two-tiered system:

Brokers as Gatekeepers and Service Name Resolvers: Brokers maintain JSON-based registries of available personal services, supporting attribute-based discovery (yellow pages) and precise handle resolution (white pages). Service handles are opaque—tokens generated and encrypted by the Broker—abstracting the actual network endpoint. When invoked, the Broker launches necessary services as needed and enforces access controls, gatekeeping which SPs can invoke which services.
Communication Flow and Protocol Extensions: The protocol relies on redirection via new HTTP status codes. An SP initiates a request with an attribute query; the browser consults the Broker, which resolves the query, possibly spawns a local service, and returns an invocation handle. The browser then redirects to the service endpoint, maintaining security boundaries and endpoint secrecy. The architecture, demonstrated for eID-based authentication, obviates the need for persistent plugins and allows safe, dynamic invocation of local logic (Zúquete et al., 2019).
Advantages and Tradeoffs:
- Decouples browser-internal logic from functional extensions.
- Supports dynamic, attribute-based service lookup and flexible instantiation.
- Imposes protocol and deployment complexity; requires reliable Broker/proxy presence per user.

3. Integration with AI and LLM-Driven Agents

Simulation of User Behavior: LLM-powered BUAs generate actions in sequential loops based on structured interface observations, mimicking or simulating human browsing for tasks such as usability testing and experimental A/B testing. Key modules convert DOM into simplified trees, route through agentic memory streams, and interleave rapid response ("Fast Loop") with deeper, reflective reasoning ("Slow Loop")—a structure informed by cognitive architectures (Lu et al., 18 Feb 2025, Wang et al., 13 Apr 2025).
Persona Modeling and Scalable Experimentation: Persona generators synthesize agent diversity, enabling controlled studies of design variants, user flows, and behavioral patterns at scale, with outputs ranging from quantitative logs to video replays and natural-language rationales.
Human-in-the-Loop Interaction: Conceptual frameworks now emphasize continual, cooperative loops between user, LLM reasoning module, and agentic execution—translating ambiguous goals into decomposed, iteratively refined tasks. Such frameworks distinguish exploration (information gathering) from exploitation (analysis/synthesis), with the agent proposing proactive action modules that are steered via user feedback, aligning with real-world browsing mental models and minimizing cognitive load (Yun et al., 15 Sep 2025).

4. Security, Adversarial Robustness, and Privacy Concerns

Systemic Vulnerabilities: BUAs dramatically expand the attack surface compared to traditional browsers. Threats include prompt injection (standard and visual/ambient), task-aligned injection, domain validation bypass, credential exfiltration, UI-level deception (clickjacking), and remote code execution via chained command sequences. Many vulnerabilities traverse architectural layers, propagating from UI or DOM manipulation up through agent planning and execution (Mudryi et al., 19 May 2025, Shapira et al., 8 Jun 2025, Jones et al., 7 Jul 2025, Cao et al., 3 Jun 2025).
Taxonomy of Risk Classes:
- UI deception and perceptual mismatch: Misleading overlays, TOCTOU vulnerabilities.
- Prompt injection and indirect code execution: Malicious instructions in user-visible content or API payloads.
- Over-delegation and identity ambiguity: Weak separation of user and agent actions leads to "confused deputy" scenarios.
- Chain-of-Thought exposure: Leaked planning traces in memory or logs can be exploited for adversarial influence (Jones et al., 7 Jul 2025).
Evaluation Frameworks and Benchmarks: Tools such as VPI-Bench and BrowserART systematically quantify BUA vulnerability rates (with observed attack success or attempt rates upwards of 80%–100% in controlled studies) (Cao et al., 3 Jun 2025, Kumar et al., 2024).
Defense Strategies:
- Defense-in-Depth: Architectures combine input sanitization, planner-executor isolation, formal security analyzers (pre-execution blocklists), and session reset/throttling.
- Oversight Mechanisms: Human-in-the-loop approval for privileged actions, transparent logging with audit trails.
- Task-aware Reasoning: LLM-as-judge, action consistency checks, and ensemble verification for deviation detection.
- Least-privilege enforcement, provenance tagging, and ephemeral tokenization for session and delegation control (Jones et al., 7 Jul 2025, Mudryi et al., 19 May 2025).
Privacy Mediation: Dedicated add-ons such as PrivWeb approach privacy by extracting interface data, classifying PII in local LLMs, selectively anonymizing high-sensitivity items, and providing tiered (allow/deny) user control mechanisms. This maintains usability while reducing cognitive overhead and minimizing exposure of private data to agents (Zhang et al., 15 Sep 2025).

5. Benchmarks, Evaluation Criteria, and Limitations

Live-Content and Multimodal Evaluation: Benchmarks such as BEARCUBS stress live web interaction, multi-step task resolution, and comprehensive coverage of modality (text, video, 3D, games), demanding agents that undertake actions similar to human users rather than exploiting textual shortcutting (Song et al., 10 Mar 2025).
Performance Gaps: Humans substantially outperform current BUAs, which demonstrate pronounced deficiencies in source selection, planning, and multimodal engagement. For example, human accuracy on BEARCUBS is 84.7%, while the best BUA agent achieves only 24.3%. On practical tasks, agents display vulnerabilities both to task planning failures and to adversarial manipulations (e.g., visual prompt injection, concealed command chaining).
Evaluation Criteria: Metrics include task and subtask success rate, adherence to validated browsing trajectories, real-time action sequencing, and robustness against adversarial input. Transparent action logging and correct grounding in intended sources are increasingly mandated (Song et al., 10 Mar 2025).

6. Future Directions and Paradigm Shifts

API Integration and Hybridization: Hybrid agents combining both browser-based control and direct API invocation demonstrate substantially improved performance for online tasks, reducing action counts and error probability, and allowing seamless fallback between modalities as dictated by content and context (Song et al., 2024).
Agentic Web Interface (AWI): A proposed shift advocates designing web interfaces natively optimized for agentic access, featuring standardized, abstracted state representations, unified high-level actions, controlled access models, and efficiency enablers. Six guiding principles have been outlined: standardization, human centricity, built-in safety, optimal representation, host efficiency, and developer friendliness. This strategic refactoring aims to transcend the inefficiencies and vulnerabilities associated with forcing agent architectures to operate atop human-centric UIs or developer APIs (Lù et al., 12 Jun 2025).
Security Evaluation and Community Collaboration: The systematization of risk necessitates multilayered, community-driven evaluation frameworks incorporating input provenance, cryptographic auditing, runtime planning audits, and task-aware confirmation protocols. Cross-disciplinary collaboration is emphasized to advance agent–web interoperability, safety, and privacy (Jones et al., 7 Jul 2025, Lù et al., 12 Jun 2025).
Ongoing Research Gaps: Key open challenges include effective resistance to prompt injection (especially visual or indirect forms), robust handling of ambiguous or evolving human intentions, scalable and explainable action transparency, and continual adaptation to shifting web content and interface architectures.

7. Applications and Outlook

The BUA paradigm underpins a spectrum of emerging applications:

Authentication and Identity Management: Invoking local services for eID verification and digital signing through browser-mediated redirections.
Large-Scale Automated UX Testing: Thousands of LLM-driven simulated users executing complex, multi-step tasks for experimental and usability research.
Privacy Mediation: Proactive, user-driven anonymization and control of sensitive data exposure in agent-mediated browsing.
Security Policy Enforcement: Automated assessment, monitoring, and mitigation of vulnerabilities in enterprise contexts through real-time parsing and behavioral monitoring.
Human–Agent Collaboration: Interaction frameworks shifting from single-instruction automation toward iterative, goal-aligned, exploration/exploitation-driven browsing support.

Research continues to refine BUA architectures, with emphasis on secure, efficient, and contextually aware integration of browser automation, agentic reasoning, and user interaction for real-world, adversarially robust deployments.