DEFT: Taxonomy for Deep Research Failures

Updated 3 December 2025

DEFT is a structured taxonomy that classifies failure modes in deep research agents and deep learning systems across reasoning, retrieval, and generation dimensions.
It employs a multi-stage grounded theory approach with LLM-assisted annotations and expert validations to ensure robust, reproducible fault diagnosis.
DEFT enables actionable improvements via quantitative scoring, diagnostic benchmarking, and integrated best practices for enhanced system transparency and resilience.

The Deep rEsearch Failure Taxonomy (DEFT) comprises a family of empirically grounded, formally structured taxonomies for classifying, diagnosing, and remediating failure modes in deep learning–based systems—including deep research agents, deep reinforcement learning (DRL) programs, and deep neural networks at large. DEFT is systematically constructed through grounded theory, multi-source artifact analysis, and inter-annotator validation, and is deployed to expose, quantify, and guide the correction of process-level, programmatic, and behavioral errors across reasoning, retrieval, generation, and learning control processes (Zhang et al., 1 Dec 2025, Olaz, 14 Jun 2025, Nikanjam et al., 2021, Humbatova et al., 2019).

1. DEFT in Deep Research Agents: Structure and Prevalence

In the context of deep research agents (DRAs)—systems designed to synthesize analyst-grade research reports—the DEFT framework delineates 14 fine-grained failure modes distributed across three core dimensions: Reasoning, Retrieval, and Generation. Each covers distinctive categories of agent limitation, error, and hallucination as cataloged through large-scale annotation of ≈1,000 DRA-generated reports (Zhang et al., 1 Dec 2025).

DEFT: Process-Level Failure Modes in Deep Research Agents

Dimension	Failure Modes	Prevalence (%)
Reasoning	FUR (Failure to Understand Requirements), LAD, LAS, RPS	28.14
Retrieval	IIA (Insufficient External Information Acquisition), IHD, IIF, IRM, VMF	33.10
Generation	RCP, SOD, CSD, DAR, SCF (Strategic Content Fabrication)	38.76

Reasoning failures include misinterpretation of requirements (FUR, 10.55%), analytic shallowness (LAD, 11.09%), incomplete dimension coverage, and inability to adapt plans. Retrieval failures encompass inadequate or misapplied search, poor data integration, low-authority citation, and skipped verification. Generation failures manifest in redundant content, poor structural adherence, off-specification outputs, insufficient analytical rigor, and data/method hallucination (SCF, 18.95%).

Krippendorff’s α reliability for taxonomy coding across Reasoning, Retrieval, and Generation ranges from 0.74 to 0.92, validating the taxonomy’s reproducibility. A positive taxonomy scoring metric, $S_i = |D|\cdot \cos((E_i/|D|)\cdot \pi/2)$ , is defined to normalize error-count–based dimension scores.

2. Empirical Construction and Annotation Methodology

The DEFT taxonomies are derived via multi-stage grounded-theory methodology (Zhang et al., 1 Dec 2025, Humbatova et al., 2019): open coding with multiple LLMs and human annotators, clustering into axial codes, codebook optimization by semantic similarity, and selective coding into hierarchical categories. Cross-validation is achieved via double annotation and reconciliation sessions with domain experts, yielding high inter-annotator agreement (α ∼ 0.80–0.87).

Artifact sources include:

DRA-produced reports (≈1,000)
GitHub commits and issues (1,981 + 1,392 after cleaning, with 1,059 DL-relevant manually labeled artifacts) (Humbatova et al., 2019)
Stack Overflow posts (extracted top 1,000 per framework; filtered to 477)
Practitioner and researcher interviews (20 semi-structured sessions)

In DRL, DEFT taxonomies are also informed by manual analysis of Stack Overflow and GitHub posts covering OpenAI Gym, Dopamine, Keras-rl, and Tensorforce (Nikanjam et al., 2021).

3. DRL-Specific DEFT: Behavioral and Programmatic Fault Taxonomies

In deep reinforcement learning, DEFT is instantiated in two complementary forms:

Categorizes agent failure via trajectory and action-pattern signatures:

Catatonic Collapse: Agent produces near-zero action vector for an extended interval; formally, $\|a_t\|_2 < \epsilon$ for $t \in [t_0, t_0+T]$ and trajectory divergence from “ghost” policies surpasses a threshold.
Manic Oscillation: Rapid alternation of divergent control signals; formally detected by oscillation rate $F = (1/W) \sum_{t=1}^{W} f_t$ exceeding $\Phi$ .
Obsessive Loop: Entrapment in short, repeating suboptimal state cycles; characterized by loop score $\ell = \max_{L \leq L_{max}}\frac{\#\,\mathrm{repeats}(L)}{T}$ .
Gradual Drift: Monotonic trajectory divergence, with linear or sublinear $\Delta x_t$ growth and high correlation $corr(\Delta x_t, t)$ .
Policy Fragmentation: Intra-episode behavioral clustering with low silhouette coefficient $\sigma_K$ , $K>1$ , and high between-segment divergence $D_{frag}$ .

Groups 11 distinct code-level faults into four categories:

Top-level Category	Fault Types	Example Manifestations (Formal/Empirical)
Environment interaction	Missing step, terminal flag, reset	No call to `env.step`, failing to test `done`, omitting `env.reset()`
Exploration	Absent/incorrect exploration strategy	Static exploitation, $\epsilon = 0.0$ (no exploration), improper decay
Network update	TD rule, frequency, assignment, gradients	TD update errors (e.g., forgetting $\gamma$ ), target update misconfigured
Output-layer faults	Wrong shape, activation	Dense(1) for multi-action, softmax on Q-values

Detection is automated via DRLinter: Python source code is parsed to generate a host graph, then graph transformation rules search for structural or value-based invariant violations, leveraging the GROOVE tool for exhaustive pattern application.

4. Comprehensive Fault Taxonomy for Deep Learning Systems

A broader DEFT taxonomy for general deep learning (DL) systems consists of 15 validated leaf categories (Humbatova et al., 2019), hierarchically organized across Training, Model, and System branches. Categories include hyperparameter tuning, loss function selection, data preprocessing, optimizer errors, data quality, model configuration, layer-specific errors, tensor shape mismatches, device targeting (GPU), and API misuse. This taxonomy was confirmed through survey validation—mean occurrence rate $\bar{p}=66\%$ , with “Training Data Quality” being the most prevalent at 95%.

Branch	Category Examples
Training	Hyperparameters, Loss, Validation, Preproc, Optimiser, Data Quality, Process
Model	Model Type, Missing/Wrong Layer, Layer Properties, Activation, Input, Tensor Shape
System	GPU Usage, API Usage

Application workflows include checklist-based reviews, monitoring/logging, mutation testing with DEFT-aligned operators, and team onboarding (Humbatova et al., 2019).

5. Application Protocols and Best Practices

DEFT taxonomies are employed in:

Diagnostic Benchmarking: Annotating failure occurrences and propagating them as structured “error datasets” for subsequent classifier training and dual-learning policies (Zhang et al., 1 Dec 2025, Olaz, 14 Jun 2025).
Algorithm and Report Improvement: Identifying and prioritizing interventions—such as inserting anti-fabrication checks (SCF), enforcing structural templates (SOD, CSD), or triggering meta-policy refinement (policy fragmentation).
Evaluation and Scoring: Utilizing positive taxonomy scoring ( $S_i$ ) to target high-frequency error regimes and provide dimension-level diagnostics.
Continuous Improvement: Tracking high-prevalence failure modes guides adjustment of training protocols and software engineering practices.

Recommendations include integration with process-level checklists, mandatory verification steps, and AR-based visualization overlays (for DRL) to map abstract metric symptoms to perceptually interpretable cues (ghost overlays, frequency meters, drift graphs) (Olaz, 14 Jun 2025).

6. Quantitative Insights and Limitations

Empirical studies using DEFT demonstrate prevalence patterns: Strategic Content Fabrication (SCF) dominates DRA generation errors; Training Data Quality is the most frequent DL-system fault. DRLinter achieves 100% recall and precision on synthetic benchmarks, but only 75% recall in real-world settings, with undetected cases involving subtle or dynamic bugs beyond the static analysis capability (Nikanjam et al., 2021).

A plausible implication is that hybrid static-dynamic analysis and richer behavioral instrumentation are necessary to cover the residual error surface—especially those involving runtime-only or numerically unstable failures.

7. Role in Advancing Robustness and Transparency

By formalizing failure patterns at multiple system levels, DEFT transforms opaque and fragmented agent or program faults into a structured taxonomic ontology. This provides unified protocols for reporting, tracing, and remedying errors—and supports data-driven adaptation, verification, and safety practices across deep learning disciplines. The interplay between statistical prevalence, formal invariants, visualization modalities, and inter-annotator validated codebooks positions DEFT as a foundational tool for process-oriented robustness and scientific reproducibility in complex deep learning workflows (Zhang et al., 1 Dec 2025, Olaz, 14 Jun 2025, Nikanjam et al., 2021, Humbatova et al., 2019).

PDF Markdown Chat (Pro)

References (4)

How Far Are We from Genuinely Useful Deep Research Agents? (2025)

Ghost Policies: A New Paradigm for Understanding and Learning from Failure in Deep Reinforcement Learning (2025)

Faults in Deep Reinforcement Learning Programs: A Taxonomy and A Detection Approach (2021)

Taxonomy of Real Faults in Deep Learning Systems (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Deep rEsearch Failure Taxonomy (DEFT).

DEFT: Taxonomy for Deep Research Failures

1. DEFT in Deep Research Agents: Structure and Prevalence

DEFT: Process-Level Failure Modes in Deep Research Agents

2. Empirical Construction and Annotation Methodology

3. DRL-Specific DEFT: Behavioral and Programmatic Fault Taxonomies

A. Behavioral Maladaptation Taxonomy (Olaz, 14 Jun 2025)

B. DRL Program Fault Taxonomy (Nikanjam et al., 2021)

4. Comprehensive Fault Taxonomy for Deep Learning Systems

5. Application Protocols and Best Practices

6. Quantitative Insights and Limitations

7. Role in Advancing Robustness and Transparency

Whiteboard

Follow Topic

Continue Learning

DEFT: Taxonomy for Deep Research Failures

1. DEFT in Deep Research Agents: Structure and Prevalence

DEFT: Process-Level Failure Modes in Deep Research Agents

2. Empirical Construction and Annotation Methodology

3. DRL-Specific DEFT: Behavioral and Programmatic Fault Taxonomies

A. Behavioral Maladaptation Taxonomy (Olaz, 14 Jun 2025)

B. DRL Program Fault Taxonomy (Nikanjam et al., 2021)

4. Comprehensive Fault Taxonomy for Deep Learning Systems

5. Application Protocols and Best Practices

6. Quantitative Insights and Limitations

7. Role in Advancing Robustness and Transparency

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics