DeepWeb-Bench: Deep Research Benchmark

Updated 4 July 2026

DeepWeb-Bench is a research benchmark designed to evaluate agents' combined capabilities in massive evidence collection, source reconciliation, and long-horizon multi-step derivation.
It employs structured task matrices with 6,400 cells, strict provenance records, and explicit scoring rubrics to ensure auditable and defensible analytical conclusions.
The benchmark challenges models to move beyond simple retrieval, focusing on derivation discipline and accurate conflict handling in multi-source, quantitative research tasks.

DeepWeb-Bench is a deep research benchmark for evaluating agents that search the open web, collect evidence from multiple sources, reconcile conflicting information, and derive quantitative answers through extended multi-step reasoning. It was introduced to increase difficulty beyond prior deep research evaluations by making each task depend on three properties of the data itself: massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. The benchmark is organized as task matrices whose cells are scored independently, and every reference answer is paired with a source-provenance record and cross-source checks, which makes the evaluation comparatively auditable (Xie et al., 20 May 2026).

1. Benchmark objective and conceptual scope

DeepWeb-Bench is designed around a specific claim about the state of frontier deep research systems: prior benchmarks no longer provide enough headroom to distinguish current models, because frontier products already score strongly on earlier evaluations. Its response is not merely to enlarge the search space, but to require all tasks to combine retrieval, reconciliation, and derivation. In this formulation, the benchmark treats deep research as a compound activity rather than as a retrieval-only problem (Xie et al., 20 May 2026).

The benchmark explicitly decomposes this difficulty structure into four capability families: Retrieval, Derivation, Reasoning, and Calibration. Retrieval captures evidence collection baseline, including whether an agent can find the authoritative source and locate the relevant disclosed value. Derivation captures multi-step composition, such as chain derivation, cross-column comparison, and sum-of-the-parts decomposition. Reasoning captures quantitative modeling under assumptions, including scenario reasoning, forward extrapolation, and counterfactual or modeled outcomes. Calibration captures source reconciliation and abstention, including cross-source conflict resolution, hallucination resistance, and returning not available when evidence is missing.

The mapping between the benchmark’s three difficulty properties and its four capability families is explicit. Massive evidence collection corresponds to Retrieval. Cross-source reconciliation corresponds to Calibration. Long-horizon multi-step derivation is represented by Derivation and Reasoning together. A central design consequence is that the benchmark intentionally includes only a small share of Retrieval cells; most cells are non-retrieval. This suggests that the benchmark is structured to probe post-retrieval competence more aggressively than source discovery alone.

A common misunderstanding is to treat DeepWeb-Bench as a generalized browsing or navigation benchmark. Its emphasis is narrower and more analytical. It does not primarily evaluate UI navigation, browser automation, or single-fact lookup. Instead, it tests whether agents can transform web evidence into derived, numerically defensible conclusions under explicit provenance constraints.

2. Task matrix design and dataset structure

The released benchmark contains 100 tasks, each with 8 entities and 8 dimensions, yielding 64 cells per task and 6,400 total cells (Xie et al., 20 May 2026). Each task targets a single industry segment or analytical universe, and the domain distribution is as follows: Technology: 25, Energy/Materials: 20, Industrials/Transport: 18, Consumer: 16, Finance: 12, and Healthcare/Pharma: 9.

Every task uses the same fixed capability-family split across its eight dimensions: 1 Retrieval, 4 Derivation, 1 Calibration, and 2 Reasoning. This produces a benchmark-level allocation of 12.5% Retrieval, 50% Derivation, 12.5% Calibration, and 25% Reasoning. Because this split is invariant across tasks, score slices by family are directly comparable across domains without requiring post hoc normalization.

Structural element	Value
Tasks	100
Entities per task	8
Dimensions per task	8
Cells per task	64
Total cells	6,400

Entity selection is constrained by comparability within a domain, but also by differing disclosure practices. This matters because some cells are meant to be answered as not available rather than forced into a guessed scalar. Appendix examples include AI accelerators represented by NVIDIA, AMD, Intel, Google, Amazon, Qualcomm, Cerebras, Groq, and the China NEV market represented by BYD, Li Auto, XPeng, NIO, Leapmotor, Seres, GAC Aion, Zeekr (Xie et al., 20 May 2026).

Each dimension contains a natural-language question and a metric specification fixing period, unit, mapping, and answer type. The benchmark permits three answer types per cell: precise value, range estimate, and not available. This design serves two purposes. First, it reduces ambiguity such as fiscal-year versus calendar-year interpretation. Second, it turns abstention into a first-class scoring target rather than an afterthought.

3. Provenance records, disclosure levels, and auditability

A defining feature of DeepWeb-Bench is the source-provenance record attached to every reference answer (Xie et al., 20 May 2026). For each cell, the record contains the reference answer type, central value or range, unit, short derivation note, ordered source list, source-provenance level for each source, support verdict for each source, one-sentence justification for the level assignment, cross-source agreement label, and scoring rule or per-cell override.

Each source entry includes four fields: URL, provenance level, retrieved support verdict, and justification for the provenance level. The retrieved support verdict is one of yes / partial / no / unverifiable. This record is described as a disclosure-provenance scheme rather than a model-based trust estimate. The benchmark emphasizes that the levels reflect disclosure practice and provenance, not a learned trust score.

The four disclosure levels are hierarchical:

Level	Description
T1	Primary regulatory filings and official disclosures
T2	Methodology-published research and formal datasets
T3	Reputable media and sell-side research
T4	Informal or weakly verifiable sources

The examples attached to these levels are specific. T1 includes SEC 10-K / 20-F, Hong Kong Stock Exchange announcements, prospectuses, official regulator publications, final regulatory rules, and some press releases that disclose numbers subject to securities law. T2 includes IDC, Gartner, TrendForce, Canalys, and IEA. T3 includes Reuters, Bloomberg, Caixin, WSJ, and named analyst reports. T4 includes blogs, forums, unsigned commentary, and unverified social posts.

Cross-source agreement is also codified. Each cell receives one of three agreement labels: consistent, divergent, or single. Consistent indicates that independent public sources agree within tolerance. Divergent indicates that public sources disagree, and that divergence is itself part of the task. Single indicates that only one independent public source exists. For divergent cells, the record includes a divergence note explaining why sources differ. This makes disagreement an evaluated phenomenon rather than a data-cleaning artifact.

This provenance apparatus underwrites the benchmark’s auditability claim. Scores are intended to be checked against explicit evidence trails, and some cells are constructed so that a correct response must either land within the reference range or explicitly flag divergence with a cited cause. A plausible implication is that the benchmark is as much about disciplined analytical reporting as about raw answer accuracy.

4. Scoring rules and evaluation protocol

DeepWeb-Bench scores each cell with one of four values:

$\{0,\;0.25,\;0.5,\;1\}$

The paper defines the task score as:

$S_{\mathrm{task}}(A)=\frac{1}{64}\sum_{i,j}\operatorname{score}(A_{ij},A^\star_{ij})$

where $A_{ij}$ is the candidate answer for cell $(i,j)$ and $A^\star_{ij}$ is the reference answer (Xie et al., 20 May 2026). The benchmark score is the mean of task scores over all tasks.

Default scoring depends on answer type. For precise value cells, full credit requires a scalar within tolerance with sign or direction agreement. The default tolerance is $\pm 10\%$ relative for ratios and monetary values and $\pm 2$ percentage points absolute for percentages. A score of 0.5 is assigned if sign or direction are correct but value is outside tolerance, or if the answer is within relaxed tolerance of $2\times$ without derivation. Otherwise the score is 0.

For range cells, full credit requires that the candidate range overlap at least 80% of the reference range and that the derivation method be specified. A score of 0.5 is assigned if overlap is between 30% and 80%, or if the derivation method is missing. Otherwise the score is 0. For not available cells, full credit requires explicit not available with a one-sentence justification; 0.5 is assigned for a wide range with estimation method; any precise numeric value receives 0.

Some cells override the defaults. Sum-of-the-parts cells require component splits to sum to $100\% \pm 2\%$ . Cross-source conflict cells require explicit divergence handling. These overrides are significant because they turn certain analytical habits—such as balance checks and conflict disclosure—into formally scored behaviors rather than informal reviewer preferences.

The model evaluation protocol uses nine frontier model configurations, all restricted to the benchmark’s own tools: web_search, page_visit, and pdf_fetch. Native browser and search tools in host systems were disabled. Each run was limited to 200 tool calls per task and 30 minutes wall-clock per task. A GPT-5.5 grader applied the per-cell rubric automatically, and validation on 200 stratified cells yielded Cohen’s $\kappa = 0.82$ with human annotation (Xie et al., 20 May 2026).

The public release includes the data, rubrics, evaluation code, source-provenance labels, and source records. It also includes full model outputs and source logs.

5. Empirical results and error patterns

The evaluation covered Codex CLI + GPT-5.5, Claude Opus 4.7, Claude Sonnet 4.6, DeepSeek V4 Pro, DeepSeek V4 Flash, GLM 5.1, Qwen 3.6 Plus, MiniMax M2.7, and Kimi K2.6 (Xie et al., 20 May 2026). The top score was 33.37% from Codex CLI + GPT-5.5. The strongest Claude Code-hosted model, Claude Opus 4.7, scored 31.84%. The weakest score was 16.79% from Kimi K2.6. The mean across models was 27.17%.

Rank	Model	Score
1	Codex CLI + GPT-5.5	33.37
2	Claude Opus 4.7	31.84
3	DeepSeek V4 Pro	28.68
4	GLM 5.1	28.18
5	Claude Sonnet 4.6	27.97
6	DeepSeek V4 Flash	27.73
7	Qwen 3.6 Plus	26.54
8	MiniMax M2.7	24.06
9	Kimi K2.6	16.79

One of the paper’s central findings is that retrieval is not the bottleneck. Human-labeled failure analysis over 500 failing cells found that retrieval gap accounted for only 12–14% of failures, while Derivation + Calibration failures exceed 70% (Xie et al., 20 May 2026). Aggregate family scores reinforce this interpretation: Retrieval: 32.83%, Derivation: 26.10%, Calibration: 27.73%, and Reasoning: 26.19%.

Failure mode	Top four models	Other five models
Hallucinated precision	22%	38%
Silent source choice	18%	14%
Incomplete derivation	31%	24%
Scope drift	15%	12%
Retrieval gap	14%	12%

The benchmark also reports a qualitative separation between strong and weak models. For the top four models, the dominant failure mode is incomplete derivation at 31%. These systems often retrieve the right intermediate values but combine them incorrectly. For the bottom five models, the dominant failure mode is hallucinated precision at 38%. These systems are more likely to produce a precise number when the correct response should be not available. The paper characterizes this as a qualitative phase transition, not merely a smooth degradation.

Model specialization is another reported finding. Mean pairwise Spearman rank correlation across models is $S_{\mathrm{task}}(A)=\frac{1}{64}\sum_{i,j}\operatorname{score}(A_{ij},A^\star_{ij})$ 0, no pair exceeds $S_{\mathrm{task}}(A)=\frac{1}{64}\sum_{i,j}\operatorname{score}(A_{ij},A^\star_{ij})$ 1, and per-case cross-model standard deviation reaches 18.8 percentage points on the most disagreed cases (Xie et al., 20 May 2026). The top two models together achieve the highest score on 79 of 100 cases, but hard cases still produce substantial spreads. The paper associates the hardest cases with domains that have non-standardized or conflicting disclosures, whereas easier cases tend to have abundant, uniform primary filings.

Three case studies illustrate the benchmark’s diagnostic intent. In a BYD per-vehicle gross profit cell, GPT-5.5 retrieves correct intermediate numbers but applies gross margin to total company revenue instead of segment revenue, producing a result about 34% below reference and scoring 0.5. In a Qualcomm Cloud AI 100 gross-margin cell, the correct answer is not available; Qwen 3.6 Plus outputs a precise number from a blog and scores 0, Claude Opus 4.7 answers explicit not available with justification and scores 1, and DeepSeek V4 Flash gives a range with method and scores 0.5. In an XPeng ASP change cell, MiniMax retrieves two divergent authoritative sources but chooses one silently and commits to a single value, scoring 0. These examples show that derivation, calibration, and source-conflict handling are scored separately from source discovery.

6. Position within the benchmark landscape

DeepWeb-Bench places itself at the far end of the spectrum of web-based information tasks. In the paper’s positioning, benchmarks such as SimpleQA, GAIA, WebWalker, and Mind2Web emphasize retrieval, browsing, or navigation, whereas BrowseComp and BrowseComp-ZH require assembling answers from many pages but typically do not require deep derivation. It further distinguishes itself from recent deep research benchmarks by insisting that every task require large-scale evidence collection, resolving disagreement across sources, and compositional derivation (Xie et al., 20 May 2026).

This orientation differs sharply from several adjacent benchmarks. Deep Research Bench evaluates multi-step open-web research under a frozen-web environment called RetroSearch, with 89 task instances across 8 task types, and was built to control for web drift over time (FutureSearch et al., 6 May 2025). HERB, presented as “Benchmarking Deep Search over Heterogeneous Enterprise Data”, measures source-aware, multi-hop retrieval and reasoning over heterogeneous enterprise artifacts such as documents, meeting transcripts, Slack, GitHub pull requests, URLs, and metadata; its central finding is that retrieval is the main bottleneck in enterprise deep search (Choubey et al., 29 Jun 2025). DRBench evaluates complex, open-ended enterprise deep research tasks that combine public web sources with private organizational data across productivity software, cloud file systems, emails, and chats, and focuses on report production rather than cell-level quantitative derivation (Abaskohi et al., 30 Sep 2025).

The benchmark is also distinct from web-interaction and automation testbeds. MacroBench is a code-first benchmark in which models synthesize reusable Python + Selenium macros from natural-language goals and HTML/DOM context on seven self-hosted sites; it evaluates executable browser automation, workflow completion, and safe macro synthesis rather than open-web analytical derivation (Kim et al., 5 Oct 2025). WebDS measures end-to-end web-based data science across 870 tasks and 29 websites, combining web navigation, data acquisition, tool-assisted analysis, and downstream reporting or action; its focus is broader workflow execution rather than provenance-backed cell scoring (Hsu et al., 2 Aug 2025).

A common misconception is therefore to treat DeepWeb-Bench as a generic “web benchmark.” It is more accurately a benchmark for auditable deep research under quantitative and provenance constraints. Its distinctive contribution is not merely difficulty, but the coupling of difficulty with explicit disclosure levels, cross-source agreement labels, per-cell answer typing, and rubrics that reward abstention, derivation discipline, and conflict handling. This suggests a shift in benchmark design from measuring whether an agent can find evidence to measuring whether it can responsibly convert evidence into defensible analytical conclusions.