Papers
Topics
Authors
Recent
2000 character limit reached

NNetNav: Unsupervised Hierarchical Web Navigation

Updated 22 December 2025
  • NNetNav is an unsupervised framework that retroactively labels browser interactions to generate synthetic demonstrations for effective web navigation.
  • It employs a hierarchical exploration and pruning strategy that efficiently searches and refines agent behaviors for realistic task execution.
  • Empirical evaluations on WebArena and MiniWoB++ benchmarks show significant performance gains over traditional instruction-first methods.

NNetNav is an unsupervised framework for training browser-based agents via interaction-driven synthetic demonstration generation and retroactive labeling. Unlike prior methods for web automation that rely on manually curated expert demonstrations or instruction-first synthetic data, NNetNav employs an exploration-centric, hierarchical approach to search and prune the space of possible agent behaviors, with retroactive assignment of meaningful natural language goals to interaction sequences. The resulting pipeline enables efficient, scalable self-supervision for browser control agents, achieving state-of-the-art results in unsupervised web navigation benchmarks (Murty et al., 2024).

1. Motivation and Problem Definition

LLMs are capable of interfacing with web browsers, mapping natural-language tasks to sequences of browser actions (click, type, navigate, etc.), but their in-the-wild performance is hindered by website layout variability, dynamic content, and hidden affordances. Reliance on expensive human data collection precludes broad scalability, while prior synthetic efforts that sample instructions and attempt execution suffer from infeasibility, triviality, and lack of complexity control over resulting demonstrations.

NNetNav addresses these issues by adopting an "interaction-first" strategy: the agent first explores the website to generate varied, feasible interaction rollouts, and only then retroactively labels prefixes of these rollouts with human-interpretable instructions when those prefixes correspond to coherent, achievable goals. This guarantees demonstration feasibility and enables precise control over the diversity and complexity of synthetic tasks (Murty et al., 2024).

2. Core Algorithmic Framework

The NNetNav procedure alternates between several LLM-based components, forming an exploration–labeling–pruning–generation loop:

  1. Exploration Policy (πexplore\pi_{\mathrm{explore}}): An LLM (with chain-of-thought prompting) generates browser actions ata_t given the observation oto_t.
  2. State-Change Summarizer (Δ\Delta): Converts triples (ot,at,ot+1)(o_t, a_t, o_{t+1}) into concise natural-language state transitions δt\delta_t.
  3. Retroactive Labeler (LL): Maps a sequence δ1:t\delta_{1:t} into a plausible instruction g^t\hat{g}_t describing the achieved subtask.
  4. Reward Model (s(g^t,δ1:t)s(\hat{g}_t,\delta_{1:t})): Verifies that the interaction prefix genuinely satisfies the instruction g^t\hat{g}_t, outputting a binary decision.
  5. Hierarchical Pruning: If s(g^t,δ1:t)=0s(\hat{g}_t,\delta_{1:t})=0, the episode is immediately pruned; otherwise, the tuple (g^t,τ1:t)(\hat{g}_t, \tau_{1:t}) is stored as a demonstration and exploration proceeds.

This loop results in a pool of verified, retroactively annotated demonstration trajectories, each tied to a meaningful, feasible web goal instruction (Murty et al., 2024).

3. Hierarchical Language Decomposition and Pruning

A key feature of the method is its use of hierarchical language decomposition to manage the combinatorially large space of browser trajectories. At a fixed interval ("pruning checkpoint"), NNetNav attempts to label the interaction prefix as a recognizable subtask. If no such label is plausible or the reward model does not confirm its validity, the episode is immediately terminated. This hierarchical pruning is guided by:

score(τ1:t)=maxuUP(uδ1:t),\mathrm{score}(\tau_{1:t}) = \max_{u\in\mathcal U} P(u|\delta_{1:t}),

where exploration proceeds only if score(τ1:t)ϵ\mathrm{score}(\tau_{1:t})\geq \epsilon, with practical implementation via LLM scoring and reward model gating.

As complex instructions typically decompose naturally into subtasks, this approach prunes exploration trees to focus on semantically rich, compositional episodes that are directly useful for agent training (Murty et al., 2024).

4. Demonstration Generation and Policy Training

All valid prefixes (g^t,τ1:t)(\hat{g}_t, \tau_{1:t}) are accumulated into a demonstration set D\mathcal{D}. Each trajectory is augmented with step-level reasoning traces rir_i via post-hoc LLM annotation to explain the rationale of each action in the context of the goal. The resulting demonstration format is:

(g^,o1,r1,a1,o2,r2,a2,,oT,rT,aT)(\hat{g}, o_1, r_1, a_1, o_2, r_2, a_2, \ldots, o_T, r_T, a_T)

A compact model (Llama-3.1-8b) is then fine-tuned on these demonstrations. For each time index tt, the input to the policy is (g^,ot,a<t)(\hat{g}, o_t, a_{<t}), and the output is (rt,at)(r_t, a_t). Training employs a batch size of (128×4096)(128\times4096), 5 epochs, maximum sequence length 4096, learning rate 2×1052\times10^{-5} with 3% linear warmup, and the DeepSpeed ZeRO-3 optimizer on four A100 GPUs (Murty et al., 2024).

5. Empirical Performance and Analysis

NNetNav was evaluated on two major benchmarks:

  • WebArena (812 tasks): Zero-shot Llama-3.1-8b baseline achieved 1% success rate; NNetNav fine-tuning elevated this to 7.2%. By contrast, instruction-first SFT achieved only 4.2%.
  • MiniWoB++ (8 tasks): 28% mean reward zero-shot increased to 48% after NNetNav, with no improvement for instruction-first approaches.

Further evaluations used a GPT-4-based fine-grained grader and confirmed consistent improvements across a range of subtasks. Ablations with "distilled" (instruction-only) variants underperformed, demonstrating the importance of the retroactive pipeline. A small-scale "self-training" experiment (NNetNav with Llama-3.1-8b as both explorer and policy) showed a 1%→5.3% WebArena gain, indicating robustness even with less powerful models (Murty et al., 2024).

Benchmark Zero-Shot Llama-3.1-8b NNetNav-SFT Instruction-First SFT
WebArena 1% 7.2% 4.2%
MiniWoB++ 28% mean reward 48% 28%

6. Limitations and Future Directions

NNetNav's demonstrated advantages include the hierarchical pruning heuristic—which curtails wasteful exploration—and a fully self-sufficient pipeline that leverages LLMs at all stages (exploration, labeling, verification, reasoning). The principal limitations are dependence on powerful LLMs for synthetic data generation and the requirement for domain-specific prompt engineering. Potential directions for extension include:

  • Integration with online reinforcement learning to incorporate reward signals directly from environment feedback.
  • Expansion to multimodal settings combining pixel and DOM-level signals.
  • Transfer of the hierarchical pruning and retro-labeling scheme to other sequential domains, including mobile app automation and API-driven workflows (Murty et al., 2024).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to NNetNav.