WebExplorer-QA: Multimodal Web Navigation

Updated 10 September 2025

The paper presents WebExplorer-QA as a large-scale repository of multimodal web trajectories and QA pairs designed to benchmark long-horizon web agents.
It employs a two-stage automated pipeline that integrates model-based exploration with long-to-short query evolution to enhance query complexity and reasoning depth.
Empirical results demonstrate the dataset's cost efficiency and its effectiveness in training agents for realistic web navigation and multi-step decision making.

The WebExplorer-QA Dataset is a systematically synthesized, large-scale repository of multimodal web trajectories and challenging query–answer pairs intended for training and evaluating long-horizon web agents. Built using model-based exploration and iterative query evolution, its ambition is to provide a rigorous benchmark and pretraining ground for agents capable of complex, multi-step reasoning in the context of authentic web navigation. The dataset unifies structured action sequences, screenshots, and associated metadata at unprecedented scale, supporting both practical autonomous agent development and empirical research into agentic LLM capabilities.

1. Systematic Data Synthesis Methodology

The construction of the WebExplorer-QA Dataset leverages a two-stage automated pipeline for high-quality data generation. In the first stage, LLMs initiate “exploration” from a seed entity by employing search and browse tool calls, accumulating a rich “information space” over many turns. This exploration is formalized as a trajectory:

$H_T = (\tau_0, \alpha_0, o_0, \tau_1, \alpha_1, o_1, \dots, \tau_T)$

where each $\tau_t$ is a Thought, $\alpha_t$ an action, and $o_t$ an observation. Throughout the traversal, agents autonomously connect related facts from diverse websites without manual graph-design constraints.

The second stage employs a controlled “long-to-short” query evolution procedure. Starting from an initial QA pair synthesized from the information space, salient clues are obfuscated by abstraction: explicit terms (e.g., unique identifiers, direct quotes) are replaced with broader descriptions, and superficial details are systematically removed. The evolution iteration for queries is defined as:

$H^{k+1} = (H^k, \tau_1^{(k)}, \alpha_1^{(k)}, o_1^{(k)}, \dots, \tau_m^{(k)})$

for $k=0, \ldots, K-1$ . The answer remains invariant, but query difficulty increases as explicit entry points are eliminated, forcing deeper multi-step reasoning for future models.

2. Dataset Composition and Characteristics

The dataset consists of over 94,000 successful multimodal web trajectories spanning 49,000 unique URLs, 720,000 screenshots, and 33 million distinct web elements (Pahuja et al., 17 Feb 2025, Liu et al., 8 Sep 2025). Each trajectory comprises, on average, 7.7 steps and approximately 46.3 web elements per screenshot, accumulating roughly 830 million tokens. The core data objects include:

Action-Observation History: Structured event logs capturing search, browse, and decision steps.
Screenshots: Pixel-level visual context for each step of navigation.
HTML/Accessibility Trees: Semantically rich representations complementing raw image data.
QA Pairs: Queries and answers synthesized from the explored information space, iteratively evolved to maximize reasoning complexity.

This architecture facilitates modeling of realistic agent interaction patterns and supports rigorous evaluation of sequential decision-making.

3. Training and Evaluation Paradigms

The dataset enables supervised fine-tuning followed by reinforcement learning for web agents such as WebExplorer-8B (Liu et al., 8 Sep 2025). Initial supervised fine-tuning employs approximately 13,000 samples to teach the model correct invocation of search/browse tools and multi-step decomposition per the ReAct framework (explicit > , <tool_call>, <tool_response> markers). Subsequent reinforcement learning (using GRPO) on 12,000 samples incrementally increases context length (up to 128K tokens) and tool invocation capability (up to 100 actions).

Empirical benchmarking demonstrates state-of-the-art performance (for the 8B parameter scale) across BrowseComp-en/zh, WebWalkerQA, and FRAMES. Notably, WebExplorer-8B achieves an accuracy of 15.7% on BrowseComp-en post–RL, 62.7% on WebWalkerQA, and 75.7% on FRAMES—substantially exceeding prior large-scale agentic models in average decision efficiency and trajectory generalization.

4. Cost Efficiency and Accessibility

A distinguishing property of the WebExplorer-QA synthesis pipeline is its cost efficiency. The average cost per raw trajectory is \$0.15; with a 53.1% success rate, this yields a per-successful-trajectory cost of approximately \$0.28. The computation is traced to the formula:

$\text{Total Cost} = \$0.0128 \times 7.7 + \$0.02581 + \$0.02381 = \$0.148$</p> <p>This pricing is favorable compared to previous efforts such as AgentTrek (\$0.55/trajectory), significantly expanding data accessibility to a broad research community (Pahuja et al., 17 Feb 2025).

5. Complexity, Diversity, and Task Design

Task complexity in the dataset arises from both the diversity of seed entities and the multi-turn reasoning requirements. Initial QA pairs are formulated to demand aggregation of facts across several pages and tool calls; after query evolution, the mean number of required actions grows from 7.9 to 9.9. Evolved queries exhibit greater obfuscation, reduced explicit clues, and increased necessity for chain-of-thought inference (Liu et al., 8 Sep 2025).

The dataset encapsulates seed entities inspired by Wikipedia topics, ensuring wide domain coverage and avoiding brittle overfitting. The diversity of the web environments—high-traffic/different domains (sampled from Tranco and similarweb.com)—ensures agents confront realistic, heterogeneous task types rather than synthetic or narrowly curated workflows.

6. Applications, Generalization, and Broader Impact

WebExplorer-QA forms the backbone for the supervised and RL training of agents capable of long-horizon web interaction, supporting up to 100 tool calling turns and context sequences of up to 128K tokens. These agents outperform larger models such as WebSailor-72B in information-seeking benchmarks and demonstrate proficiency in knowledge-intensive domains, including STEM QA (17.3% accuracy on HLE) despite domain mismatch in training data.

The systematic, automated synthesis approach reduces the annotation bottleneck inherent in manual or rule-based data construction. This enables scalable creation of training corpora for agentic LLMs, supporting broader empirical research into long-horizon planning, interpretability, and tool-augmented decision-making (Liu et al., 8 Sep 2025).

A plausible implication is that parameter efficiency (demonstrated by the 8B model outperforming much larger agents) shifts research focus toward sophisticated training strategies and data design rather than brute-force scaling.

7. Future Research Directions

WebExplorer-QA’s synthesis pipeline suggests generative approaches to QA pair/trajectory construction can catalyze advances in long-horizon reasoning and planning. RL techniques that balance structural formatting (interleaved tool calls, thoughts, and responses) against answer accuracy may enable more robust, interpretable, and autonomous web agents. The foundation provided by multimodal trajectories, extensive obfuscation, and large context support paves the way for further work in scalable agent training, cross-domain generalization, and dynamic query complexity adjustment.

This dataset and methodology may serve as reference for future studies examining how iterative evolution, search/browse simulation, and challenging, obfuscated QA pairs promote agent advances in information retrieval, automation, and complex problem solving in real-world digital environments.