Papers
Topics
Authors
Recent
2000 character limit reached

DEFT: Taxonomy for Deep Research Failures

Updated 3 December 2025
  • DEFT is a structured taxonomy that classifies failure modes in deep research agents and deep learning systems across reasoning, retrieval, and generation dimensions.
  • It employs a multi-stage grounded theory approach with LLM-assisted annotations and expert validations to ensure robust, reproducible fault diagnosis.
  • DEFT enables actionable improvements via quantitative scoring, diagnostic benchmarking, and integrated best practices for enhanced system transparency and resilience.

The Deep rEsearch Failure Taxonomy (DEFT) comprises a family of empirically grounded, formally structured taxonomies for classifying, diagnosing, and remediating failure modes in deep learning–based systems—including deep research agents, deep reinforcement learning (DRL) programs, and deep neural networks at large. DEFT is systematically constructed through grounded theory, multi-source artifact analysis, and inter-annotator validation, and is deployed to expose, quantify, and guide the correction of process-level, programmatic, and behavioral errors across reasoning, retrieval, generation, and learning control processes (Zhang et al., 1 Dec 2025, Olaz, 14 Jun 2025, Nikanjam et al., 2021, Humbatova et al., 2019).

1. DEFT in Deep Research Agents: Structure and Prevalence

In the context of deep research agents (DRAs)—systems designed to synthesize analyst-grade research reports—the DEFT framework delineates 14 fine-grained failure modes distributed across three core dimensions: Reasoning, Retrieval, and Generation. Each covers distinctive categories of agent limitation, error, and hallucination as cataloged through large-scale annotation of ≈1,000 DRA-generated reports (Zhang et al., 1 Dec 2025).

DEFT: Process-Level Failure Modes in Deep Research Agents

Dimension Failure Modes Prevalence (%)
Reasoning FUR (Failure to Understand Requirements), LAD, LAS, RPS 28.14
Retrieval IIA (Insufficient External Information Acquisition), IHD, IIF, IRM, VMF 33.10
Generation RCP, SOD, CSD, DAR, SCF (Strategic Content Fabrication) 38.76

Reasoning failures include misinterpretation of requirements (FUR, 10.55%), analytic shallowness (LAD, 11.09%), incomplete dimension coverage, and inability to adapt plans. Retrieval failures encompass inadequate or misapplied search, poor data integration, low-authority citation, and skipped verification. Generation failures manifest in redundant content, poor structural adherence, off-specification outputs, insufficient analytical rigor, and data/method hallucination (SCF, 18.95%).

Krippendorff’s α reliability for taxonomy coding across Reasoning, Retrieval, and Generation ranges from 0.74 to 0.92, validating the taxonomy’s reproducibility. A positive taxonomy scoring metric, Si=Dcos((Ei/D)π/2)S_i = |D|\cdot \cos((E_i/|D|)\cdot \pi/2), is defined to normalize error-count–based dimension scores.

2. Empirical Construction and Annotation Methodology

The DEFT taxonomies are derived via multi-stage grounded-theory methodology (Zhang et al., 1 Dec 2025, Humbatova et al., 2019): open coding with multiple LLMs and human annotators, clustering into axial codes, codebook optimization by semantic similarity, and selective coding into hierarchical categories. Cross-validation is achieved via double annotation and reconciliation sessions with domain experts, yielding high inter-annotator agreement (α ∼ 0.80–0.87).

Artifact sources include:

  • DRA-produced reports (≈1,000)
  • GitHub commits and issues (1,981 + 1,392 after cleaning, with 1,059 DL-relevant manually labeled artifacts) (Humbatova et al., 2019)
  • Stack Overflow posts (extracted top 1,000 per framework; filtered to 477)
  • Practitioner and researcher interviews (20 semi-structured sessions)

In DRL, DEFT taxonomies are also informed by manual analysis of Stack Overflow and GitHub posts covering OpenAI Gym, Dopamine, Keras-rl, and Tensorforce (Nikanjam et al., 2021).

3. DRL-Specific DEFT: Behavioral and Programmatic Fault Taxonomies

In deep reinforcement learning, DEFT is instantiated in two complementary forms:

Categorizes agent failure via trajectory and action-pattern signatures:

  • Catatonic Collapse: Agent produces near-zero action vector for an extended interval; formally, at2<ϵ\|a_t\|_2 < \epsilon for t[t0,t0+T]t \in [t_0, t_0+T] and trajectory divergence from “ghost” policies surpasses a threshold.
  • Manic Oscillation: Rapid alternation of divergent control signals; formally detected by oscillation rate F=(1/W)t=1WftF = (1/W) \sum_{t=1}^{W} f_t exceeding Φ\Phi.
  • Obsessive Loop: Entrapment in short, repeating suboptimal state cycles; characterized by loop score =maxLLmax#repeats(L)T\ell = \max_{L \leq L_{max}}\frac{\#\,\mathrm{repeats}(L)}{T}.
  • Gradual Drift: Monotonic trajectory divergence, with linear or sublinear Δxt\Delta x_t growth and high correlation corr(Δxt,t)corr(\Delta x_t, t).
  • Policy Fragmentation: Intra-episode behavioral clustering with low silhouette coefficient σK\sigma_K, K>1K>1, and high between-segment divergence DfragD_{frag}.

Groups 11 distinct code-level faults into four categories:

Top-level Category Fault Types Example Manifestations (Formal/Empirical)
Environment interaction Missing step, terminal flag, reset No call to env.step, failing to test done, omitting env.reset()
Exploration Absent/incorrect exploration strategy Static exploitation, ϵ=0.0\epsilon = 0.0 (no exploration), improper decay
Network update TD rule, frequency, assignment, gradients TD update errors (e.g., forgetting γ\gamma), target update misconfigured
Output-layer faults Wrong shape, activation Dense(1) for multi-action, softmax on Q-values

Detection is automated via DRLinter: Python source code is parsed to generate a host graph, then graph transformation rules search for structural or value-based invariant violations, leveraging the GROOVE tool for exhaustive pattern application.

4. Comprehensive Fault Taxonomy for Deep Learning Systems

A broader DEFT taxonomy for general deep learning (DL) systems consists of 15 validated leaf categories (Humbatova et al., 2019), hierarchically organized across Training, Model, and System branches. Categories include hyperparameter tuning, loss function selection, data preprocessing, optimizer errors, data quality, model configuration, layer-specific errors, tensor shape mismatches, device targeting (GPU), and API misuse. This taxonomy was confirmed through survey validation—mean occurrence rate pˉ=66%\bar{p}=66\%, with “Training Data Quality” being the most prevalent at 95%.

Branch Category Examples
Training Hyperparameters, Loss, Validation, Preproc, Optimiser, Data Quality, Process
Model Model Type, Missing/Wrong Layer, Layer Properties, Activation, Input, Tensor Shape
System GPU Usage, API Usage

Application workflows include checklist-based reviews, monitoring/logging, mutation testing with DEFT-aligned operators, and team onboarding (Humbatova et al., 2019).

5. Application Protocols and Best Practices

DEFT taxonomies are employed in:

  • Diagnostic Benchmarking: Annotating failure occurrences and propagating them as structured “error datasets” for subsequent classifier training and dual-learning policies (Zhang et al., 1 Dec 2025, Olaz, 14 Jun 2025).
  • Algorithm and Report Improvement: Identifying and prioritizing interventions—such as inserting anti-fabrication checks (SCF), enforcing structural templates (SOD, CSD), or triggering meta-policy refinement (policy fragmentation).
  • Evaluation and Scoring: Utilizing positive taxonomy scoring (SiS_i) to target high-frequency error regimes and provide dimension-level diagnostics.
  • Continuous Improvement: Tracking high-prevalence failure modes guides adjustment of training protocols and software engineering practices.

Recommendations include integration with process-level checklists, mandatory verification steps, and AR-based visualization overlays (for DRL) to map abstract metric symptoms to perceptually interpretable cues (ghost overlays, frequency meters, drift graphs) (Olaz, 14 Jun 2025).

6. Quantitative Insights and Limitations

Empirical studies using DEFT demonstrate prevalence patterns: Strategic Content Fabrication (SCF) dominates DRA generation errors; Training Data Quality is the most frequent DL-system fault. DRLinter achieves 100% recall and precision on synthetic benchmarks, but only 75% recall in real-world settings, with undetected cases involving subtle or dynamic bugs beyond the static analysis capability (Nikanjam et al., 2021).

A plausible implication is that hybrid static-dynamic analysis and richer behavioral instrumentation are necessary to cover the residual error surface—especially those involving runtime-only or numerically unstable failures.

7. Role in Advancing Robustness and Transparency

By formalizing failure patterns at multiple system levels, DEFT transforms opaque and fragmented agent or program faults into a structured taxonomic ontology. This provides unified protocols for reporting, tracing, and remedying errors—and supports data-driven adaptation, verification, and safety practices across deep learning disciplines. The interplay between statistical prevalence, formal invariants, visualization modalities, and inter-annotator validated codebooks positions DEFT as a foundational tool for process-oriented robustness and scientific reproducibility in complex deep learning workflows (Zhang et al., 1 Dec 2025, Olaz, 14 Jun 2025, Nikanjam et al., 2021, Humbatova et al., 2019).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Deep rEsearch Failure Taxonomy (DEFT).