Step-Level Failure Diagnosis Overview

Updated 22 June 2026

Step-level failure diagnosis is a method that decomposes multi-step traces into distinct units to detect, localize, and categorize errors with structured taxonomies.
It employs hierarchical evaluation, DAG traversal, and causal inference techniques to enhance debugging precision and actionable analysis.
This approach is applied in LLM workflows, industrial automation, software debugging, and cyber-physical systems to boost reliability and repair efficiency.

Step-level failure diagnosis is a methodological paradigm in complex systems analysis—including agentic workflows, multi-agent coordination, cyber-physical automation, and software engineering—whose objective is to detect, localize, and categorize the precise execution step(s) responsible for a failure event. Unlike coarse, end-to-end evaluation or non-local fault attribution, step-level diagnosis decomposes multi-step traces into semantically distinct units ("steps," "spans," or "nodes") and outputs a mapping from trace locations to failure labels and rationales, often under a structured error taxonomy. This enables granular root-cause analysis, actionable debugging, and targeted repair in systems where failure propagation, step dependencies, or partial recoverability are central concerns.

1. Foundational Formalisms for Step-Level Diagnosis

In typical agentic or workflow contexts, an execution trace is modeled as a rooted, ordered tree or directed acyclic graph (DAG). Each node (span, step, or agent action) is annotated by its type (e.g., LLM call, tool invocation), input–output data, and contextual structure (e.g., parent/child links) (Madvil et al., 14 May 2026, Guo et al., 26 Apr 2026). Letting $\mathcal{T} = (S, \mathsf{Ch})$ represent the tree with spans $S$ and child function $\mathsf{Ch}$ , the diagnosis task is to identify a set $S_{\mathrm{fail}} \subseteq S$ of spans at which failure arises, assign error categories from a taxonomy $\mathcal{C}$ , and provide per-span rationales.

In multi-agent systems (MAS), a trace is typically a temporally ordered sequence $\tau = \{x_1, x_2, \ldots, x_T\}$ , with each $x_t$ capturing agent identity, action, and context (Zhu et al., 2 Jun 2026, Chen et al., 24 Apr 2026). The goal is to compute scores $S_i$ that designate the root-cause step $t^* = \arg\max_i S_i$ .

Crucially, these frameworks distinguish between:

Localization: Pinpointing the failing node(s) or step(s).
Categorization: Assigning an error type or label.
Rationale: Producing a human-interpretable or machine-actionable explanation.

These principles underpin diverse applications: agentic LLM workflows, storage stacks, industrial automation, RL training, microservice architectures, and classical software.

2. Core Methodological Frameworks

Step-level failure diagnosis is realized through a spectrum of frameworks:

Hierarchical Span-Level Evaluation: Each leaf span is judged individually by a scoring rubric (often LLM-based) and verdicts are propagated up the execution tree; failures are classified under detailed taxonomies (e.g., Formatting Error, Resource Abuse) (Madvil et al., 14 May 2026).
DAG-Structured Analysis: AgentEval models trace steps as nodes in a DAG, linking failures via data dependencies. Local failures are attributed by traversing the DAG and propagating error attributions along edges, enabling recovery of both root-cause origin and error propagation chains (Guo et al., 26 Apr 2026).
Causal Inference and Counterfactuals: Causal discovery algorithms reverse execution dependencies, apply functional causal models (with, e.g., Shapley value corrections), and perform counterfactual simulations to quantify the impact of hypothetical repairs on global outcomes (Ma et al., 10 Sep 2025).
Semantic Temporal Modeling: StepFinder embeds each step using a frozen LLM, then applies bi-directional LSTM and multihead attention (with agent-awareness and multi-scale differencing) to model evolution and dependency, enabling efficient and accurate ranking of root-cause steps (Zhu et al., 2 Jun 2026).
Diagnosis-and-Repair Pipelines: For agentic reasoning, pipelines such as Doctor-RAG decouple diagnosis (coverage-gated categorization and earliest step localization) from repair (targeted surgical correction with maximal prefix reuse) (Jiao et al., 1 Apr 2026).

These frameworks address the central challenge of mapping lengthy, multi-agent, or highly structured traces to actionable diagnoses under constraints of scalability, accuracy, and interpretability.

3. Step-Level Metrics, Taxonomies, and Evaluation Benchmarks

Precise metrics and taxonomies are central to robust step-level diagnosis:

Taxonomies: Frameworks introduce hierarchical, fine-grained taxonomies (e.g., AgentEval’s 21 subcategories across Planning, Execution, Integration) (Guo et al., 26 Apr 2026); Doctor-RAG uses a coverage-gated taxonomy for retrieval-augmented generation (Jiao et al., 1 Apr 2026).
Metrics: Key quantitative metrics include
- Localization Accuracy: Fraction of predicted failing steps matching ground truth.
- Category F1: Weighted F1 per category.
- Joint Localization–Categorization: Fraction of (step, category) pairs correct (Madvil et al., 14 May 2026).
Benchmarks: Evaluation is performed on purpose-built datasets:
- TRAIL: Multi-agent and code task traces with taxonomic labels, supporting per-category analysis (Madvil et al., 14 May 2026).
- TraceElephant (BugHunter): MAS failure traces with full observability, supporting step-level and agent-level attribution via the "decisive failure step" principle (Chen et al., 24 Apr 2026).
- RootSE: Complex repository-level software trajectories, supporting decisive error step annotation (Wang et al., 26 May 2026).

Controlled experiments demonstrate that use of full trace observability and structured context yields up to 76% higher localization accuracy over output-only attribution (Chen et al., 24 Apr 2026); span-level decomposition and dependency modeling yield up to 3.5× improvements in localization accuracy for code tasks (Madvil et al., 14 May 2026, Wang et al., 26 May 2026).

4. Domain-Specific Instantiations and System Architectures

The step-level failure diagnosis paradigm has been specialized for diverse domains:

Agentic LLM Workflows: Holistic, tree- or DAG-structured analysis with rubric-guided scoring, with both bottom-up (span-level) and top-down (agent-level) evaluation supporting error localization and diagnosis (Madvil et al., 14 May 2026, Guo et al., 26 Apr 2026).
Agentic Coding Systems: Investigative controllers, semantic folding modules for aggressive context reduction, and prior-phase classification enable step-level root-cause identification in long software traces (Wang et al., 26 May 2026).
Multi-Agent Coordination: Temporal and cross-agent modeling, taxonomy-based anomaly detection, backward symptom tracing, and tool-grounded verification enable robust diagnosis, often enhanced by episodic memory and counterfactual experimentation (Zhu et al., 2 Jun 2026, Li et al., 19 Apr 2026, Ma et al., 10 Sep 2025).
Retrieval-Augmented Generation: Doctor-RAG’s coverage-aware, minimal-intervention repair after explicit error localization allows efficient correction, with up to 25.8% EM gains and 2–4x higher repair rates versus full rerun (Jiao et al., 1 Apr 2026).
Industrial and Cyber-Physical Systems: Step-level SMT-based diagnosis in manufacturing traces faults to specific steps, tools, or timing deviations, achieving up to 80% accuracy on synthetic RIM datasets (Krantz et al., 2023); cross-layer correlation trees isolate root causes across stack boundaries in storage systems (Zhang et al., 2020).
Software Debugging: Feature-centric machine learning—instrumenting processes to extract scalar pairs, def-use, branch, and coverage metrics, then training decision trees—generates interpretable, condition-specific rules for step-level fault localization (Smytzek et al., 25 Feb 2025).

These variants share the core principle of localizing failures to well-defined steps, underpinned by explicit models of trace structure and information flow.

5. Practical Considerations, Limitations, and Empirical Findings

Empirical research consistently finds that method choice and trace granularity critically impact diagnosis fidelity:

Trace Observability: Full input/output and context records are essential; removal of input fields or metadata can degrade step-level accuracy by up to 76% (Chen et al., 24 Apr 2026).
Dependency Modeling: Explicit DAG or tree structure is the single most critical factor, yielding +22pp recall and +34pp root-cause accuracy over flat step labeling (Guo et al., 26 Apr 2026).
Failure Propagation: Step-level methods are sensitive to propagation effects; frameworks distinguish between primary and propagated errors—for instance, AgentEval tags steps propagated from upstream parents via score-based heuristics (Guo et al., 26 Apr 2026).
Scalability and Efficiency: Lightweight encodings (StepFinder), folding and on-demand context expansion (TrajAudit), and graph-based filtering (X-Ray, DiagFusion) address inference bottlenecks in long traces, reducing compute cost or latency by 2–5× (Zhu et al., 2 Jun 2026, Zhang et al., 2020, Zhang et al., 2023, Wang et al., 26 May 2026).
Model Capacity vs. Methodology: Step-level performance is often limited less by the underlying judge (LLM or model) than by how the evaluation is structured; error localization can improve 3–12× under more granular decomposition even with the same base model (Madvil et al., 14 May 2026).

A limitation appears in scenarios with ambiguous or latent failure modes, where partial traces or missing sensor data hinder localization (Krantz et al., 2023); further, causal inference-based methods assume acyclic dependency graphs and can struggle with feedback loops or non-stationary confounding factors (Ma et al., 10 Sep 2025).

6. Outlook and Future Directions

Active research is extending step-level failure diagnosis in several directions:

Self-Improving Systems: Episodic memory modules accumulating verified, tool-grounded error patterns support robust cross-domain transfer and self-improving diagnosis without annotation (Li et al., 19 Apr 2026).
Causal and Counterfactual Methods: Fine-grained causal discovery (CDC-MAS) and counterfactual simulations guide step selection for targeted repair, boosting global success rates by >22 points on Who→When benchmarks (Ma et al., 10 Sep 2025).
Integration with CI/CD: Seamless integration into engineering workflows, as demonstrated by AgentEval's CI/CD pilots, has reduced root-cause identification latency by an order of magnitude (Guo et al., 26 Apr 2026).
Cross-Modal and Multi-Modal Expansion: Diagnosis fusing logs, metrics, and trace modalities (e.g., DiagFusion) achieves up to 368% localization improvement over single-modal baselines (Zhang et al., 2023).
Fine-Tuning and Auto-Remediation: Post-training failure management loops (RFT-FM) that diagnose, classify, and intervene upon step-level fault types are emerging for RL-fine-tuned LLMs (Zhang et al., 6 May 2026).

Further research is poised to address continuous control settings, real-time streaming trace analysis, adaptation to multi-modal and multi-lingual agent environments, and the incorporation of richer causal and structural priors. Robust, step-level failure diagnosis is thus foundational for beyond-silo transparency, reliability, and self-repair in the next generation of complex autonomous and agentic systems.