Wide and Deep Research Agent

Updated 4 July 2026

Wide and Deep Research Agent is a system that integrates broad tool diversity (width) with rigorous multi-hop, iterative reasoning (depth) to tackle complex, long-horizon tasks.
Architectural patterns include hierarchical orchestration, explicit research-plan structuring, and domain-specialized decomposition to enhance both parallel tool use and sequential verification.
Empirical evaluations reveal that balancing extensive information gathering with deep reasoning remains challenging, driving innovations in scheduler design and training optimization.

A wide and deep research agent is an agentic system that couples broad information acquisition with rigorous, iterative reasoning over long-horizon tasks. In the recent literature, width is used in several closely related senses: broad support for external tools, modalities, and heterogeneous tasks; parallel exploration across subproblems or tool calls; and large-volume information collection over many entities and attributes. Depth denotes multi-step reasoning, multi-hop retrieval, verification, self-critique, and synthesis. One benchmark formalizes width as the total number of atomic information units, $W=M\times N$ , and depth as the average retrieval trajectory length, $D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ ; another line of work defines depth as sequential tool-call depth and width as the number of parallel tool calls within a reasoning turn (Lan et al., 23 Oct 2025, Lin et al., 7 Feb 2026, Su et al., 26 Feb 2026). The paradigm has been instantiated for systems-code crash resolution, enterprise analytics, multilingual asset scouting, frontier scientific reasoning, and citation-grounded long-form reporting (Singh et al., 27 May 2025, Prabhakar et al., 20 Oct 2025, Vinogradova et al., 16 Feb 2026, Zheng et al., 2 May 2026).

1. Conceptual definition and formalization

The most explicit general definition appears in MiroFlow, which characterizes a wide and deep research agent as a system combining width—the ability to interface with a broad spectrum of external tools, modalities, and benchmarks, while supporting heterogeneous tasks such as information retrieval, data analysis, planning, and future-event prediction—with depth—the ability to perform rigorous, multi-step reasoning, verification, and self-critique, including optional heavy-reasoning procedures that allocate extra compute or ensemble multiple LLM calls (Su et al., 26 Feb 2026). This definition treats width and depth as orthogonal but jointly necessary properties.

DeepWideSearch supplies a task-theoretic formalization. It models information seeking as filling an $M\times N$ table of entities and attributes, with width $W=M\times N$ and depth $D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ , where $\ell_j$ is the retrieval trajectory length for entity $j$ . On that benchmark, the average table volume is $\mathbb{E}[W]=414.10$ information units and the average reasoning depth is $\mathbb{E}[D]=4.21$ hops per entity, making the benchmark simultaneously broader than prior wide-only settings and substantially deeper than many large-scale collection tasks (Lan et al., 23 Oct 2025).

A second formalization, introduced by W&D, treats width as intrinsic parallelism in tool use. In a standard sequential trace, an agent alternates reasoning and a single tool call across $T$ turns. In the parallel variant, each turn emits one reasoning step and a set $D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ 0 of $D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ 1 simultaneous tool calls. Under this view, depth is sequential turn count and width is per-turn tool-call multiplicity, directly tying the width–depth trade-off to latency, token cost, and search coverage (Lin et al., 7 Feb 2026).

This dual formalization is significant because it separates three frequently conflated notions: large external tool coverage, large information volume, and large parallel search branching. A plausible implication is that “wide and deep” is not a single architecture class but a family of agent designs that realize breadth and depth through different computational mechanisms.

2. Architectural patterns

The dominant architectural motif is hierarchical orchestration. MiroFlow models execution as a directed graph $D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ 2, where each node encapsulates an LLM backbone, prompt schema, toolset, sub-agents, and I/O processors, and where orchestration is performed by a control tier (Su et al., 26 Feb 2026). WideSeek likewise uses a two-tier hierarchy with a planner that maintains global state and forks parallel sub-agents through create_sub_agent(...), while DuMate-DeepResearch separates an outer Research Agent from inner Search Agents that each run their own bounded planning–execution loop (Huang et al., 2 Feb 2026, Yan et al., 5 Jun 2026).

A second motif is explicit research-plan structure. FlashResearch optimizes a tree-structured research plan under time and compute budgets, with planning nodes and research nodes executed through a global asynchronous task pool (Nie et al., 2 Oct 2025). Super Research uses a MECE DAG of Phases, Chapters, and Search Queries, followed by super-wide retrieval, super-deep investigation, evidence-graph construction, and report writing (Dong et al., 28 Feb 2026). DuMate-DeepResearch maintains a DAG-structured global plan $D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ 3 with a ready frontier of executable nodes and allows reflection, re-planning, backtracking, and parallel branching (Yan et al., 5 Jun 2026).

A third motif is domain-specialized decomposition. Code Researcher is a three-phase agent for large systems code: analysis through deep research over code semantics, patterns, and commit history; synthesis through filtered structured memory and patch generation; and validation through build-and-reproducer testing (Singh et al., 27 May 2025). Enterprise Deep Research uses a Master Planning Agent, four specialized search agents, an MCP-based tool ecosystem, a Visualization Agent, and a reflection mechanism for gap detection and steering (Prabhakar et al., 20 Oct 2025).

System	Structural pattern	Representative result
Code Researcher	Three-phase analysis–synthesis–validation over code and commit history	CRR = 58.0% on kBenchSyz at P@5 (Singh et al., 27 May 2025)
MiroFlow	Agent graph with optional heavy-reasoning mode and robust executor	GAIA-Test 71.3%; ensemble heavy mode 81.1% (Su et al., 26 Feb 2026)
WideSeek	Planner plus parallel sub-agents with end-to-end RL	Mean@4 Item-F1 19.73% for WideSeek-8B-SFT-RL (Huang et al., 2 Feb 2026)
DuMate-DeepResearch	Outer Research Agent plus recursive inner Search Agents and rubrics	58.03% on DeepResearch Bench; 61.95% on Bench II (Yan et al., 5 Jun 2026)

These architectures differ in control topology, but they converge on the same systems principle: width is usually implemented through decomposition, concurrency, or tool diversity, while depth is implemented through explicit iterative control, verification, or nested search loops.

3. Planning, memory, reasoning, and stopping criteria

Wide and deep agents generally externalize intermediate state rather than relying on an implicit monolithic context. In Code Researcher, the analysis phase iterates over current memory $D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ 4, selects reasoning strategies such as control/data-flow tracing, anti-pattern detection, and causal commit analysis, issues actions like search_definition, search_code, and search_commits, and appends each $D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ 5 pair to structured memory. For synthesis, entries are ranked by recency and semantic similarity to the crash report,

$D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ 6

and the top- $D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ 7 entries are retained for patch generation (Singh et al., 27 May 2025).

In Enterprise Deep Research, planning is made persistent and steerable through a versioned todo.md task plan. The Master Planning Agent decomposes the query into atomic tasks, schedules by priority, and updates the plan after each reflection cycle. Reflection examines knowledge gaps, task misalignment, and quality inconsistencies, then issues structured updates such as mark_completed, cancel_tasks, and add_tasks. Completion is tied to a formal Coverage Completeness Score, with research considered complete when Coverage $D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ 8 (Prabhakar et al., 20 Oct 2025).

DuMate-DeepResearch makes reasoning criteria explicit through rubrics. A persistent rubric $D=\frac{1}{M}\sum_{j=1}^{M}\ell_j$ 9 encodes topic-level criteria, while an ephemeral rubric $M\times N$ 0 identifies current evidence gaps and is injected into both the outer planner and inner search agents. The stopping predicate is adaptive: the loop halts when the ephemeral rubric reports zero gaps or when the iteration limit is reached (Yan et al., 5 Jun 2026). AgentCPM-Report introduces a related but writing-centered control loop, the Writing As Reasoning Policy, which alternates Evidence-Based Drafting with Reasoning-Driven Deepening by searching for evidence section by section, writing grounded paragraphs, and expanding the outline only where the draft still contains logical gaps (Li et al., 6 Feb 2026).

FlashResearch adds runtime pruning and speculative execution. Breadth is chosen by a policy $M\times N$ 1 that determines how many child subqueries to open, depth by a policy $M\times N$ 2 that tests whether another planning layer is worth the expected utility gain, and orchestration by a policy $M\times N$ 3 that can prune low-value subtrees and immediately reassign freed compute threads (Nie et al., 2 Oct 2025). This suggests a general design pattern: wide-and-deep agents tend to replace a linear chain-of-thought with an explicit search state, an explicit revision mechanism, and an explicit stopping rule.

4. Training and inference-time optimization

Some wide-and-deep systems are primarily frameworks, while others are trained agent foundation models. MiroFlow is mainly an orchestration framework: it adds optional heavy-reasoning via an ensemble policy or a generator–verifier loop, and it hardens execution with message normalization, retry–fallback wrappers, and fault isolation (Su et al., 26 Feb 2026). Cognitive Kernel-Pro similarly emphasizes framework-level robustness through modular sub-agents, reflection under four rubrics—Non-Empty, Reasonable, Successful, Reliable—and majority-vote inference over repeated full-agent runs (Fang et al., 1 Aug 2025).

Other systems internalize wide-and-deep behavior through supervised and reinforcement learning. WideSeek linearizes planner and sub-agent trajectories into a unified sequence and optimizes end-to-end with Group-Relative PPO, using a reward that combines Item-F1 with a hallucination or format/tool-calling penalty (Huang et al., 2 Feb 2026). SciResearcher trains a main agent over tool-augmented scientific trajectories with a two-stage recipe: supervised fine-tuning on successful teacher trajectories followed by RL with GRPO, where the episodic reward is 1 for a correct final answer and 0 otherwise, and only the main agent is updated while sub-agents remain frozen tools (Zheng et al., 2 May 2026).

Fathom-DeepResearch focuses on search-policy shaping. Its Fathom-Search-4B model is trained first with GRPO and then with RAPO, which adds curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers to stabilize long-horizon multi-turn RL. It further introduces a step-level reward that classifies tool calls by cognitive behavior and marginal utility, explicitly exposing breadth, verification depth, and total horizon as steerable quantities through counts of unique search, redundant search, exploration, verification, and redundant query actions (Singh et al., 28 Sep 2025). AgentCPM-Report adopts a three-stage curriculum—cold-start SFT, atomic-skill RL, and holistic pipeline RL—over the five actions Initialize, Search, Write, Expand, and Terminate, aligning outline revision, retrieval, drafting, and stopping with report-level quality objectives (Li et al., 6 Feb 2026).

S1-DeepResearch broadens the training problem itself. Rather than focusing only on search-centric QA, it constructs graph-grounded trajectories spanning closed-ended QA and open-ended exploration, then verifies them with capability-specific verifiers for reasoning, citations, constraints, file outputs, and skill usage. A plausible implication is that the training distribution, not only the agent controller, determines whether an agent learns to be “deep research” rather than merely “deep search” (Dong et al., 13 Jun 2026).

5. Evaluation ecosystem and empirical profile

The evaluation landscape now distinguishes wide-and-deep capability from ordinary browsing or QA. DeepWideSearch is the first benchmark explicitly designed to stress both axes jointly. It contains 220 tasks across 15 domains, created by converting deep-search datasets into wider table-filling tasks and wide-search datasets into deeper multi-hop tasks. Despite strong underlying models, the best reported system achieves only Success Rate Avg@4 = 2.39% and Pass@4 = 3.64%, with Column-F1 Avg@4 = 42.01% and CE Acc. Avg@4 = 70.91%, showing that simultaneous depth and width remains difficult (Lan et al., 23 Oct 2025).

WideSeekBench targets General Broad Information Seeking rather than long-form report synthesis. It contains 5,156 tasks, with 4,436 for training and 720 for testing, balanced across 18 domains, ten information-volume bins ranging from 4 to 4096 cells, and seven constraint-composition types. On this benchmark, the full WideSeek-8B-SFT-RL system reaches Mean@4 Item-F1 19.73%, above both the base model and partial training variants (Huang et al., 2 Feb 2026).

Benchmarks for long-form report generation emphasize citation-grounded synthesis. LiveResearchBench contains 100 expert-curated tasks spanning daily life, enterprise, and academia, and evaluates reports with DeepEval across Presentation & Organization, Factual Logical Consistency, Coverage, Analysis Depth, Citation Association, and Citation Accuracy (Wang et al., 16 Oct 2025). Super Research is explicitly a ceiling-level stress test: it contains 300 expert-written questions over 10 domains, with tasks requiring roughly 100–140 retrieval steps and approximately 600–1,200 pages of evidence, and audits reports using graph-anchored metrics for Coverage, Logical Consistency, Report Utility, Objectivity, and Citation Health (Dong et al., 28 Feb 2026).

Within this ecosystem, several systems report strong results on their target settings. MiroFlow reports state-of-the-art performance across GAIA, BrowseComp-EN/ZH, HLE, xBench-DeepSearch, and FutureX, including 71.3% on GAIA-Test with GPT-5 and 81.1% under ensemble heavy mode (Su et al., 26 Feb 2026). DuMate-DeepResearch reports the best overall score on DeepResearch Bench at 58.03% and on DeepResearch Bench II at 61.95%, ranking first in information recall and analysis on the latter (Yan et al., 5 Jun 2026). These scores, however, are not directly interchangeable across benchmarks, because the underlying tasks range from table completion to scientific reasoning to user-centric long-form report generation.

6. Domains, failure modes, and open research questions

Wide-and-deep agents have become domain-specific research systems rather than a single generic web-browsing pattern. In systems code, Code Researcher targets crash mitigation in large codebases and commit histories. On 200 reproducible kBenchSyz Linux-kernel crashes at P@5, it attains CRR = 58.0%, compared with 31.5% for SWE-agent, while reading 29.13 unique files per crash on average versus 1.91 for SWE-agent; removing search_commits reduces CRR from 48.0% to 38.0% and buggy-file recall from 0.51 to 0.33, underscoring the value of historical context (Singh et al., 27 May 2025).

In enterprise analytics, Enterprise Deep Research combines public and enterprise-private sources through specialized search agents, MCP-based tools, reflection, and live steering. It reports Over. = 49.86 on DeepResearch Bench and Win = 71.57%, Tie = 19.12%, Loss = 9.31%, Avg.Score = 6.82 on DeepConsult, alongside enterprise deployment figures such as SQL generation accuracy above 95% and 99.9% uptime (Prabhakar et al., 20 Oct 2025). In drug asset scouting, the Bioptic Agent treats search as a self-learning directive tree with multilingual investigator agents, a validator, a deduplication agent, and a coach agent; on its completeness benchmark it achieves 79.7% F1, outperforming Claude Opus 4.6, Gemini 3 Pro + Deep Research, OpenAI GPT-5.2 Pro, Perplexity Deep Research, and Exa Websets (Vinogradova et al., 16 Feb 2026). In frontier science, SciResearcher-8B reaches 19.46% on HLE-Bio/Chem-Gold, with 13–15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature (Zheng et al., 2 May 2026).

The main misconception addressed by this literature is that deep research can be reduced to retrieval scale alone. DeepWideSearch shows that broad collection without sufficient reflection, retrieval completeness, and context management produces failure modes such as lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow (Lan et al., 23 Oct 2025). LiveResearchBench similarly reports recurring citation-discipline failures, context loss under scale, shallow synthesis, and formatting errors, and concludes that many systems function as “deep searchers” rather than “deep researchers” (Wang et al., 16 Oct 2025). S1-DeepResearch makes the same point from the training side, arguing that search-centric datasets under-cover evidence integration, knowledge synthesis, planning, file understanding, and structured report generation (Dong et al., 13 Jun 2026).

A second open question concerns how to allocate width and depth. W&D finds that scaling width through parallel tool calling can improve BrowseComp accuracy while reducing turns, latency, and cost, and that a hand-tuned Descending scheduler outperforms the Automatic scheduler; FlashResearch likewise shows that adaptive breadth/depth planning and real-time pruning can deliver up to a 5x speedup while maintaining comparable quality (Lin et al., 7 Feb 2026, Nie et al., 2 Oct 2025). This suggests that optimal wide-and-deep behavior depends not only on model capability, but also on scheduler design, memory control, and the criteria used for pruning, stopping, and verification.

Taken together, the literature defines the wide and deep research agent as a research-oriented control regime rather than a single model family: wide in its acquisition of sources, tools, branches, or sub-agents; deep in its use of iterative reasoning, verification, and synthesis; and increasingly evaluated by benchmarks that require both properties simultaneously.