Go-Browse: Graph-Based Web Exploration
- Go-Browse is a method for training web agents using explicit graph search to navigate complex, deeply nested web environments.
- It systematically expands web graphs by managing exploration frontiers and confirming task feasibility with modules like NavExplorer and FeasibilityChecker.
- State-of-the-art results on the WebArena benchmark demonstrate its effective data collection and improved performance on deep navigation tasks.
Go-Browse is a method for training web agents capable of structured exploration across web environments, with an emphasis on scalable, diverse data collection. By reframing agent exploration as an explicit graph search, Go-Browse enables efficient and comprehensive coverage of previously unseen web sites, supporting the development and fine-tuning of web agents that demonstrate improved performance on complex, deeply nested navigation tasks. The methodology is instantiated on the WebArena benchmark, yielding state-of-the-art results for open-weights models in the sub-10B parameter regime (Gandhi et al., 4 Jun 2025).
1. Formal Environment and Problem Setting
The underlying environment is modeled as a deterministic or stochastic transition system. The state space is composed of the current goal or task description , a flattened accessibility-tree (DOM snapshot) of the web page, the history of past actions and execution errors, and a list of available browser primitives (e.g., click, fill, scroll, goto). Actions %%%%2%%%% are atomic browser calls such as click(elemID), scroll(x,y), fill(elemID,text), goto(url), send_msg_to_user(msg), report_infeasible(reason), executed via a browser environment simulator. The transition function is defined as .
Each trajectory is evaluated against the goal by a binary reward model , marking success (1) or failure (0). A central research challenge addressed by Go-Browse is efficient exploration: agents often fail to discover semantically significant or deeply-nested web pages, instead repeating unproductive action sequences when unfamiliar with the environment.
2. Structured Exploration via Graph Construction
Go-Browse operationalizes structured exploration by representing the web environment as a graph , where nodes correspond to unique URLs (or page states) and edges are navigation transitions between them. Data collection proceeds by iteratively expanding this graph:
- Graph expansion: At each discovered node , NavExplorer and PageExplorer modules propose candidate navigation tasks . Tasks that lead from to a previously unknown URL (confirming feasibility via the FeasibilityChecker) yield new edges .
- Frontier management: A frontier contains nodes that have been discovered but not thoroughly explored. Exploration iteratively selects nodes from by cost heuristics such as minimum depth (breadth-first), i.e., , or more generally by for a learned heuristic .
The comprehensive Go-Browse algorithm is outlined by the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
procedure Go-Browse(Websites W)
Initialize dataset D ← ∅; Graph G=(V,E) ← (∅,∅); Frontier F ← ∅
for each site W_i in W do
v_root ← root URL of W_i
V ← V ∪ {v_root}; F ← F ∪ {v_root}
while F ≠ ∅ do
v ← SelectAndRemoveFromFrontier(F)
s_v ← GetCurrentState(v)
G_nav ← NavExplorer.propose_tasks(s_v)
G_local ← PageExplorer.propose_tasks(s_v)
G_prop ← G_nav ∪ G_local
G_feas ← ∅
for each task g in G_prop do
(is_feas, τ_fc, v_new) ← FeasibilityChecker.check_and_collect(g,s_v,R,N_max)
if is_feas then
D ← D ∪ {(g, τ_fc)}; G_feas ← G_feas ∪ {g}
if v_new ∉ V then
V ← V ∪ {v_new}; E ← E ∪ {(v → v_new)}; F ← F ∪ {v_new}
end if
end if
end for
for each g in G_feas do
T_pref ← Solvers.sample(g, s_v, R, prefixed=True)
T_unpref ← Solvers.sample(g, v_root, R, prefixed=False)
D ← D ∪ { (g, τ) | τ ∈ T_pref ∪ T_unpref }
end for
end while
end for
return D
end procedure |
This approach enables systematic graph expansion, facilitating the reuse of exploration knowledge and reducing redundant sampling.
3. Dataset Construction and Composition
Go-Browse was applied to the five self-hosted WebArena domains (Shopping Admin, Shopping, Reddit, Gitlab, Map), targeting 20 distinct URLs per domain (100 in total). The resulting Go-Browse-WA dataset comprises:
| Statistic | Value |
|---|---|
| Successful trajectories | 9,504 |
| Unsuccessful trajectories | 17,245 |
| Total trajectories | 26,749 |
| Successful steps | 39,339 |
| Failed steps | 157,123 |
| Total steps | 196,462 |
| Unique task descriptions | 3,422 |
Of the successful trajectories, 36.6% were sampled by GPT-4o-mini, 33.9% by Claude-3.7-Sonnet, and 29.5% by Qwen-2.5-7B-Instruct.
Task collection included both "prefixed" (starting from a discovered node ) and "unprefixed" (starting from the root) sampling strategies, supporting downstream model robustness and the ability to bootstrap weaker models, particularly for deeper graph nodes.
4. Model Architecture and Training Regimen
Supervised fine-tuning was conducted on the 7B-parameter Qwen-2.5-7B-Instruct LLM, using only successful (goal, trajectory) pairs. Each instance is prepended by the user goal, followed by a sequence of pairs, with the model trained to autoregressively generate action sequences by minimizing cross-entropy loss. No auxiliary losses were included.
Training configuration:
- Maximum sequence length: 24K tokens
- Batch size: 8 (1 per GPU), gradient accumulation: 4 (effective batch: 32)
- Learning rate: , Adam optimizer
- 2 epochs over ~9,504 successful trajectories
- Hardware: 8 × H100 GPUs; total ≈ 40 hours
Additionally, a comparable model was trained on the NNetNav-WA dataset (45K steps), establishing a baseline for evaluation.
5. Empirical Evaluation and Comparative Analysis
Performance was assessed on 812 WebArena test tasks within BrowserGym, using binary task completion rates as the metric. Key results are as follows:
| Model | Overall | Admin | Shopping | Gitlab | Map | |
|---|---|---|---|---|---|---|
| GPT-4o-mini | 19.3% | 19.2% | 19.3% | 21.1% | 20.9% | 15.6% |
| GPT-4o | 37.6% | 35.7% | 32.3% | 50.9% | 36.7% | 37.5% |
| Claude-3.7-Sonnet | 45.4% | 37.4% | 37.0% | 58.8% | 52.0% | 47.7% |
| Qwen-2.5-7B-Instruct | 8.3% | 7.1% | 9.4% | 7.9% | 8.7% | 7.8% |
| NNetNav-7B | 18.8% | 14.3% | 20.3% | 23.7% | 19.9% | 17.2% |
| Go-Browse-7B | 21.7% | 25.3% | 22.4% | 30.7% | 15.3% | 17.9% |
Go-Browse-7B achieves a 21.7% overall success rate, surpassing GPT-4o-mini by 2.4% and NNetNav-7B by 2.9%. Notable observations include:
- Tasks with deeper URL-paths (depth ) were solved more effectively by Go-Browse-7B.
- Prefixed trajectory sampling provided a 20–30% higher success rate on deep-page navigation for Qwen-2.5.
- Task diversity is enhanced, reducing the frequency of redundant, shallow navigation episodes relative to NNetNav.
6. Limitations and Prospective Extensions
The current Go-Browse experiments are constrained to the five WebArena domains. Broader generalization would require further data from diverse, real-world sites (e.g., e-commerce, banking, news). Fine-tuning presently leverages only successful traces; incorporating approximately 39K unsuccessful trajectories through auxiliary failure signals or RL-style objectives is suggested as a mechanism for improved robustness. Scaling beyond the 7B parameter threshold and investigating in-context or retrieval-augmented architectures represent logical extensions for bridging the performance gap relative to closed-weight supermodels.
A plausible implication is that the explicit graph-based exploration methodology is effective not only for data efficiency but also for supporting modular task generation and enabling sophisticated downstream agent behavior in complex, multi-hop web environments.