BrowserAgent: Autonomous Web Agent
- BrowserAgent is a framework that autonomously interacts with live web pages using direct browser actions, mirroring human browsing behavior.
- It employs a two-phase training regime—supervised and rejection fine-tuning—to achieve robust multi-hop question answering without extensive datasets.
- Its explicit memory mechanism supports long-horizon reasoning and scalability, yielding approximately 20% higher accuracy on complex tasks compared to prior systems.
BrowserAgent is a web agent framework that autonomously interacts with live, dynamic web pages using a repertoire of human-inspired browser actions. In contrast to approaches that convert the interactive web environment into static textual representations, BrowserAgent performs direct, fine-grained manipulations in the browser—mirroring human scroll, click, type, and tab-management behaviors. The system is architected atop Playwright and leverages a two-stage supervised and rejection fine-tuning protocol, achieving state-of-the-art results on multi-hop question answering tasks without requiring the large-scale datasets or reinforcing learning commonly employed in prior systems. Further, BrowserAgent incorporates an explicit structured memory to support long-horizon reasoning. The following sections detail the architecture, operational principles, training methodology, empirical performance, memory system, and scalability of BrowserAgent (Zhang et al., 12 Oct 2025).
1. System Architecture and Human-Inspired Action Set
BrowserAgent is structured as a fully interactive browser agent that executes and reasons over a sequence of low-level browser actions exposed through Playwright:
- Page Operation Actions:
click,hover,press,scroll,type—to directly manipulate page elements and interact with input fields. - Tab Management Actions:
new_tab,tab_focus,close_tab—to manage multi-tab exploration and asynchronous workflows. - URL Navigation Actions:
goto,go_back,go_forward—to traverse, revisit, or progress through navigation histories. - Completion Action:
stop—to terminate the task upon logical completion.
This action vocabulary was deliberately chosen to balance expressiveness and minimalism, ensuring the agent can approximate naturalistic web browsing without reliance on auxiliary tools that convert the DOM or rendered output to static text or partial abstractions.
The agent's interaction policy—explicitly modeled as a sequence of (observation, action) pairs—enables it to incrementally process and adapt to web page dynamics, capturing fine-grained changes missed by static approaches.
2. Training Procedure: Supervised and Rejection Fine-Tuning
BrowserAgent employs a two-phase training setup—Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT):
- SFT Phase:
- The initial model (Qwen2.5-7B-Instruct) is fine-tuned on 5.3K multi-turn, question-answer browser action trajectories synthesized from interactive browsing sessions.
- During SFT, the model is presented with gold-standard sequences, learning to emit explicit reasoning steps and their associated browser actions, as well as properly formatted outputs.
- RFT Phase:
- For each training input, multiple completions are generated by the SFT model.
- An evaluation metric (Exact Match, EM) is used to select among candidate completions—crucially, only training instances that elicit both correct and incorrect answers are retained.
- The trajectory with the greatest number of reasoning steps is preferentially selected; this encourages depth and compositionality in reasoning.
- To avoid catastrophic forgetting, 80% of the original SFT data is interleaved in RFT batches.
This training regime is effective in instilling both structural answer fidelity and robust multi-hop reasoning, without resorting to reinforcement learning or large-scale behavior cloning.
3. Explicit Memory Mechanism for Long-Horizon Reasoning
A critical differentiator in BrowserAgent is its explicit structured memory mechanism. At each interaction turn, the agent:
- Observes the current web state.
- Executes a browser action.
- Stores key intermediate conclusions (partial answers) in memory.
For instance, in multi-hop QA ("Who is the father of the greatest NBA player?"), the agent first determines and memorizes "Michael Jordan is the greatest NBA player," then continues to "Jordan's father is James Jordan," explicitly passing this information across steps.
This architectural pattern enables:
- Context persistence over many interaction steps.
- Avoidance of information loss in long multi-turn tasks.
- Integration of intermediate computation to reduce redundancy and support compositional problem-solving.
4. Comparative Performance Evaluation
Empirical evaluation demonstrates that BrowserAgent-7B yields approximately 20% higher accuracy than Search-R1 on complex multi-hop reading comprehension datasets (HotpotQA, 2Wiki, Bamboogle). For HotpotQA, the model achieves 0.458 EM, outperforming both Search-R1 and WebDancer variants at equivalent or lower data scale.
This can be summarized (for a given metric) as:
These gains are attributed to enhanced reasoning chain length, better context carry-over via explicit memory, and the agent's ability to operate directly within a live dynamic web context, unlike systems bounded by static content extraction or indirect interaction.
5. Scalability and Practical Deployment
BrowserAgent’s execution stack is designed for high-throughput, real-world usage:
- Utilizes a Ray-parallelized controller to orchestrate dozens of Playwright browser instances in parallel.
- Achieves throughput exceeding 50 episodes/minute on a 32-core server.
- Reduces data collection and inference costs by over an order of magnitude compared to legacy static-content agents.
The agent’s modularity and infrastructure enable robust application to tasks such as automated research, live web data extraction, QA, and interactive assistant deployments.
6. Impact, Applications, and Future Directions
BrowserAgent introduces a paradigm shift in web agent design—eschewing proxy or conversion layers in favor of direct, fine-grained manipulation of web environments via browser-native actions. The model’s explicit memory and human-inspired operations significantly enhance performance on long-horizon and compositional tasks. Its scalability and modularity make it suitable for a broad spectrum of real-world information-seeking, QA, and navigation applications, providing a foundation for future advances in interactive, scalable autonomous web agents.
The observed performance gains over prior art (e.g., Search-R1, WebDancer) suggest that BrowserAgent’s approach—direct browser action space, combined with explicit memory and careful SFT/RFT protocol—is highly effective for open-domain, web-based reasoning (Zhang et al., 12 Oct 2025).