Deep Research Failure Taxonomy (DEFT)

Updated 4 December 2025

The paper presents DEFT, a rigorously developed taxonomy classifying DRA failures across reasoning, retrieval, and generative stages using grounded theory and analysis of ∼1,000 logs.
DEFT is defined by three core dimensions and 14 axial modes, with precise metrics and inter-annotator reliability scores that standardize error assessment.
The taxonomy offers actionable insights for DRA improvement, emphasizing adaptive planning, integrated retrieval controls, and enhanced generative verifications.

The Deep rEsearch Failure Taxonomy (DEFT) provides a rigorous, grounded-theory–based classification of failure modes encountered by Deep Research Agents (@@@@3@@@@) in the automated production of analyst-level scientific reports. Developed via collaborative human–LLM annotation and spanning ∼1,000 execution logs and nine contemporary DRAs, DEFT delivers the first systematic taxonomy detailing the kinds, frequencies, and causal roots of error arising from reasoning, retrieval, and generative processes in research agent workflows. Distinct from conventional DL systems fault taxonomies, DEFT targets failures unique to multi-stage, evidence-intensive research tasks, standardizing assessment and underpinning future DRA architecture and evaluation research (Zhang et al., 1 Dec 2025).

1. Construction Methodology and Theoretical Foundations

DEFT was constructed through a multistage, mixed human–LLM codification pipeline anchored in grounded theory. The initial data comprised ∼1,000 DRA execution logs in both English and Chinese. Five LLMs (Claude Opus 4.1, Gemini 2.5 Pro, Grok 4, DeepSeek-V3.1, Qwen3-Max-Preview) independently generated failure reports and labeled conceptual failure candidates.

The coding process proceeded as follows:

Open Coding: Execution logs were partitioned; each group received two rounds of LLM-driven concept extraction, with seed concepts from prior literature. Near-duplicate concepts were merged (cosine similarity threshold $\theta_{\text{sim}} = 0.6$ ). Codes with frequencies below threshold $\tau_\text{freq}$ were pruned, yielding a codebook $C=\{(c_i, d_i)\mid i=1…N\}$ of 51 discrete failure concepts.
Axial Coding: These 51 codes were grouped into coherent “axial” categories by deductive/inductive analysis of semantics, causality, context, and protocol steps. Three coding rounds were conducted with 24–54 records per round, and triple human expert annotation. Inter-annotator reliability was quantified by Krippendorff’s alpha: $\alpha = 1 - D_o/D_e$ , with results between 0.74–0.92 per core category (see Section 3).
Selective Coding: The 14 resultant axial categories were distilled into three core dimensions (Reasoning, Retrieval, Generation) via theoretical saturation (confirmed by annotating 36 records from two held-out systems).

A monotonic positive taxonomy mapping was defined for success scoring:

$S_i = |D|\cdot\cos\left( \frac{E_i}{|D|} \cdot \frac{\pi}{2} \right)$

where $|D|$ is the total number of evaluated instances and $E_i$ the count of errors in category $i$ . This formula accentuates differences at moderate-to-high error rates and ensures interpretable evaluation outcomes (Zhang et al., 1 Dec 2025).

2. Full Taxonomy: Core and Axial Failure Modes

DEFT delineates failure across three Level 1 (core) dimensions, each with fine-grained Level 2 (axial) categories. The taxonomy, full definitions, representative examples, and observed prevalence by error proportion are summarized below.

Level 1	Level 2 (Axial)	Proportion (%)
Reasoning	Failure to Understand Requirements (FUR)	10.55
	Lack of Analytical Depth (LAD)	11.09
	Limited Analytical Scope (LAS)	0.90
	Rigid Planning Strategy (RPS)	5.60
Retrieval	Insufficient External Information Acquisition (IIA)	16.30
	Information Handling Deficiency (IHD)	2.26
	Information Integration Failure (IIF)	2.91
	Information Representation Misalignment (IRM)	2.91
	Verification Mechanism Failure (VMF)	8.72
Generation	Redundant Content Piling (RCP)	2.51
	Structural Organization Dysfunction (SOD)	2.26
	Content Specification Deviation (CSD)	10.73
	Deficient Analytical Rigor (DAR)	4.31
	Strategic Content Fabrication (SCF)	18.95

Definitions and examples are as follows:

Reasoning
- Failure to Understand Requirements: Misaligning user intent with output, substituting methodology for substantive results.
- Lack of Analytical Depth: Producing superficial or list-like analysis that omits structural/conceptual detail.
- Limited Analytical Scope: Addressing only one aspect of multi-dimensional tasks.
- Rigid Planning Strategy: Sustaining an initial approach despite observed infeasibility or feedback.
Retrieval
- Insufficient External Information Acquisition: Overusing internal knowledge, neglecting updated or field-specific sources.
- Information Handling Deficiency: Failing to properly extract, prioritize, or reuse retrieved information.
- Information Integration Failure: Contradictions/omissions in combining multiple retrieved inputs.
- Information Representation Misalignment: Inadequate differentiation between high- and low-authority sources.
- Verification Mechanism Failure: Neglecting cross-checks, omitting peer validation steps.
Generation
- Redundant Content Piling: Verbose repetition, undermining report clarity.
- Structural Organization Dysfunction: Fragmented or non-cohesive report structure.
- Content Specification Deviation: Deliverable misalignment (e.g., narrative text instead of required tables).
- Deficient Analytical Rigor: Overconfident or unqualified assertions, lacking uncertainty analysis.
- Strategic Content Fabrication: Invention of plausible yet unsupported data or narratives—the single highest-frequency failure mode.

3. Statistical Metrics, Inter-Annotator Reliability, and Visualization

DEFT incorporates quantitative grounding via error proportions and annotated reliability:

Codebook at Level 2 finalized with 14 categories via three coding rounds.
Key inter-annotator reliability metrics (Krippendorff’s alpha) were:
- Reasoning: 0.80–0.74
- Retrieval: 0.90–0.76
- Generation: 0.92–0.90

Core summary diagrams include:

Figure 1 (pipeline overview: open coding → optimization → axial coding → selective coding)
Figure 2 (taxonomy: three core categories, 14 axial modes)
Figure 3 (failure correlation matrix, grouping modes by Process Integrity, Content Integration, Evidentiary Rigor)
Table 1 (taxonomy, shown in Section 2)
Table 2 (detailed ICR scores) (Zhang et al., 1 Dec 2025)

4. Comparative Context: DEFT and Other Taxonomies

While DEFT uniquely targets the integrated, multi-stage report generation process of DRAs, its structure reflects conceptual lineage from previous taxonomies:

The “Taxonomy of Real Faults in Deep Learning Systems” groups failures at the level of model construction, tensor/input specification, training process, GPU, and API usage. Its coverage includes 92 unique leaf-type failures cataloged from 1,059 real-world artefacts, with prevalence ranging from hyperparameter tuning (86%) to model type selection and data pipeline errors (Humbatova et al., 2019).
Prior software engineering literature addresses field failures (e.g., irreproducible execution conditions, combinatorial explosion, or unknown-application/environmental conditions) but does not disaggregate the epistemic failures central to research agents (Gazzola et al., 2017).

DEFT differentiates itself by its fine granularity, multi-agent/multilingual scope, and attention to research-oriented failure, especially the high prevalence and diversity of fabrication and generative errors not anticipated in traditional DL system taxonomies.

5. Implications and Future Research Directions

Analysis using DEFT led to three principal insights with architectural implications for future DRAs:

Reasoning Resilience vs. Intensity: Current agents comprehend task instructions but lack the capacity to revise plans in light of retrieval/feasibility feedback. Future architectures should prioritize reasoning resilience: adaptive planning contingent on intermediate results and environmental signals.
Closed-Loop Retrieval: Retrieval failures are often not query construction problems but relate to the integrated management, evaluation, and representation of external evidence. Integrated retrieval controllers should be developed, tracking information states throughout the entire retrieval–synthesis loop and mandating cross-validation prior to generation.
Generative Constraints and Verification: The highest aggregate failure rates occur in the generative phase, most notably through specification deviation and strategic content fabrication. Embedding both pre-constraints (style/format/sanity checks) and post-verification (factual reference/cross-checks, redundancy suppression) directly within synthesis pipelines is essential.

By mapping error frequencies and their correlations, DEFT enables systematic targeting of intervention points and diagnostic instrumentation. Its reproducible annotation workflow and formal success metric support robust benchmarking for both DRA development and comparative research (Zhang et al., 1 Dec 2025).

6. Significance for DRA Development and Evaluation

DEFT standardizes the identification and quantitative assessment of critical breakdowns in the research agent pipeline, filling the gap left by prior answer-focused, subjectively evaluated benchmarks. It provides an actionable diagnostic framework, supports metric-based comparison of agent architectures, and clarifies the loci of persistent failure—all prerequisites for progress toward robust, trustworthy DRAs suitable for analyst-level report generation and synthesis tasks in scientific and technical domains (Zhang et al., 1 Dec 2025).

PDF Markdown Chat (Pro)

References (3)

How Far Are We from Genuinely Useful Deep Research Agents? (2025)

Taxonomy of Real Faults in Deep Learning Systems (2019)

An Exploratory Study of Field Failures (2017)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Deep Research Failure Taxonomy (DEFT).