Papers
Topics
Authors
Recent
2000 character limit reached

Go-Browse: Graph-Based Web Exploration

Updated 22 December 2025
  • Go-Browse is a method for training web agents using explicit graph search to navigate complex, deeply nested web environments.
  • It systematically expands web graphs by managing exploration frontiers and confirming task feasibility with modules like NavExplorer and FeasibilityChecker.
  • State-of-the-art results on the WebArena benchmark demonstrate its effective data collection and improved performance on deep navigation tasks.

Go-Browse is a method for training web agents capable of structured exploration across web environments, with an emphasis on scalable, diverse data collection. By reframing agent exploration as an explicit graph search, Go-Browse enables efficient and comprehensive coverage of previously unseen web sites, supporting the development and fine-tuning of web agents that demonstrate improved performance on complex, deeply nested navigation tasks. The methodology is instantiated on the WebArena benchmark, yielding state-of-the-art results for open-weights models in the sub-10B parameter regime (Gandhi et al., 4 Jun 2025).

1. Formal Environment and Problem Setting

The underlying environment is modeled as a deterministic or stochastic transition system. The state space SS is composed of the current goal or task description gg, a flattened accessibility-tree (DOM snapshot) of the web page, the history of past actions and execution errors, and a list of available browser primitives (e.g., click, fill, scroll, goto). Actions %%%%2%%%% are atomic browser calls such as click(elemID), scroll(x,y), fill(elemID,text), goto(url), send_msg_to_user(msg), report_infeasible(reason), executed via a browser environment simulator. The transition function TT is defined as st+1=T(st,at)s_{t+1} = T(s_t, a_t).

Each trajectory τ=(s1,a1,...,sT,aT)\tau = (s_1, a_1, ..., s_T, a_T) is evaluated against the goal gg by a binary reward model R(g,τ){0,1}R(g, \tau) \in \{0,1\}, marking success (1) or failure (0). A central research challenge addressed by Go-Browse is efficient exploration: agents often fail to discover semantically significant or deeply-nested web pages, instead repeating unproductive action sequences when unfamiliar with the environment.

2. Structured Exploration via Graph Construction

Go-Browse operationalizes structured exploration by representing the web environment as a graph G=(V,E)G = (V, E), where nodes VV correspond to unique URLs (or page states) and edges EE are navigation transitions between them. Data collection proceeds by iteratively expanding this graph:

  • Graph expansion: At each discovered node vv, NavExplorer and PageExplorer modules propose candidate navigation tasks gig_i. Tasks that lead from vv to a previously unknown URL vv' (confirming feasibility via the FeasibilityChecker) yield new edges e=(vv)e = (v \rightarrow v').
  • Frontier management: A frontier FVF \subseteq V contains nodes that have been discovered but not thoroughly explored. Exploration iteratively selects nodes from FF by cost heuristics such as minimum depth (breadth-first), i.e., vargminuFdepth(u)v \leftarrow \arg\min_{u \in F} \mathrm{depth}(u), or more generally by vargminuF[depth(u)+h(u)]v \leftarrow \arg\min_{u \in F} [\mathrm{depth}(u) + h(u)] for a learned heuristic hh.

The comprehensive Go-Browse algorithm is outlined by the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
procedure Go-Browse(Websites W)
  Initialize dataset D ← ∅; Graph G=(V,E) ← (∅,∅); Frontier F ← ∅
  for each site W_i in W do
    v_root ← root URL of W_i
    V ← V ∪ {v_root};  F ← F ∪ {v_root}
    while F ≠ ∅ do
      v ← SelectAndRemoveFromFrontier(F)
      s_v ← GetCurrentState(v)
      G_nav ← NavExplorer.propose_tasks(s_v)
      G_local ← PageExplorer.propose_tasks(s_v)
      G_prop ← G_nav ∪ G_local
      G_feas ← ∅
      for each task g in G_prop do
        (is_feas, τ_fc, v_new) ← FeasibilityChecker.check_and_collect(g,s_v,R,N_max)
        if is_feas then
          D ← D ∪ {(g, τ_fc)}; G_feas ← G_feas ∪ {g}
          if v_new ∉ V then
            V ← V ∪ {v_new};  E ← E ∪ {(v → v_new)}; F ← F ∪ {v_new}
          end if
        end if
      end for
      for each g in G_feas do
        T_pref  ← Solvers.sample(g, s_v, R, prefixed=True)
        T_unpref ← Solvers.sample(g, v_root, R, prefixed=False)
        D ← D ∪ { (g, τ) | τ ∈ T_pref ∪ T_unpref }
      end for
    end while
  end for
  return D
end procedure

This approach enables systematic graph expansion, facilitating the reuse of exploration knowledge and reducing redundant sampling.

3. Dataset Construction and Composition

Go-Browse was applied to the five self-hosted WebArena domains (Shopping Admin, Shopping, Reddit, Gitlab, Map), targeting 20 distinct URLs per domain (100 in total). The resulting Go-Browse-WA dataset comprises:

Statistic Value
Successful trajectories 9,504
Unsuccessful trajectories 17,245
Total trajectories 26,749
Successful steps 39,339
Failed steps 157,123
Total steps 196,462
Unique task descriptions 3,422

Of the successful trajectories, 36.6% were sampled by GPT-4o-mini, 33.9% by Claude-3.7-Sonnet, and 29.5% by Qwen-2.5-7B-Instruct.

Task collection included both "prefixed" (starting from a discovered node vv) and "unprefixed" (starting from the root) sampling strategies, supporting downstream model robustness and the ability to bootstrap weaker models, particularly for deeper graph nodes.

4. Model Architecture and Training Regimen

Supervised fine-tuning was conducted on the 7B-parameter Qwen-2.5-7B-Instruct LLM, using only successful (goal, trajectory) pairs. Each instance is prepended by the user goal, followed by a sequence of (stateaction)(\text{state} \rightarrow \text{action}) pairs, with the model trained to autoregressively generate action sequences by minimizing cross-entropy loss. No auxiliary losses were included.

Training configuration:

  • Maximum sequence length: 24K tokens
  • Batch size: 8 (1 per GPU), gradient accumulation: 4 (effective batch: 32)
  • Learning rate: 2×1052 \times 10^{-5}, Adam optimizer
  • 2 epochs over ~9,504 successful trajectories
  • Hardware: 8 × H100 GPUs; total ≈ 40 hours

Additionally, a comparable model was trained on the NNetNav-WA dataset (45K steps), establishing a baseline for evaluation.

5. Empirical Evaluation and Comparative Analysis

Performance was assessed on 812 WebArena test tasks within BrowserGym, using binary task completion rates as the metric. Key results are as follows:

Model Overall Admin Shopping Reddit Gitlab Map
GPT-4o-mini 19.3% 19.2% 19.3% 21.1% 20.9% 15.6%
GPT-4o 37.6% 35.7% 32.3% 50.9% 36.7% 37.5%
Claude-3.7-Sonnet 45.4% 37.4% 37.0% 58.8% 52.0% 47.7%
Qwen-2.5-7B-Instruct 8.3% 7.1% 9.4% 7.9% 8.7% 7.8%
NNetNav-7B 18.8% 14.3% 20.3% 23.7% 19.9% 17.2%
Go-Browse-7B 21.7% 25.3% 22.4% 30.7% 15.3% 17.9%

Go-Browse-7B achieves a 21.7% overall success rate, surpassing GPT-4o-mini by 2.4% and NNetNav-7B by 2.9%. Notable observations include:

  • Tasks with deeper URL-paths (depth 5\geq 5) were solved more effectively by Go-Browse-7B.
  • Prefixed trajectory sampling provided a 20–30% higher success rate on deep-page navigation for Qwen-2.5.
  • Task diversity is enhanced, reducing the frequency of redundant, shallow navigation episodes relative to NNetNav.

6. Limitations and Prospective Extensions

The current Go-Browse experiments are constrained to the five WebArena domains. Broader generalization would require further data from diverse, real-world sites (e.g., e-commerce, banking, news). Fine-tuning presently leverages only successful traces; incorporating approximately 39K unsuccessful trajectories through auxiliary failure signals or RL-style objectives is suggested as a mechanism for improved robustness. Scaling beyond the 7B parameter threshold and investigating in-context or retrieval-augmented architectures represent logical extensions for bridging the performance gap relative to closed-weight supermodels.

A plausible implication is that the explicit graph-based exploration methodology is effective not only for data efficiency but also for supporting modular task generation and enabling sophisticated downstream agent behavior in complex, multi-hop web environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Go-Browse.