Error-Based Diagnostic Taxonomies

Updated 24 April 2026

Error-based diagnostic taxonomies are systematic frameworks that classify observable and inferred errors in various domains based on specific criteria.
They integrate multi-label classification and hierarchical structures to map errors to actionable remediation and evaluation protocols.
They support system improvement through quantitative metrics like coverage, exclusivity, and class balance, ensuring robust and scalable quality assurance.

Error-based diagnostic taxonomies systematically categorize, formalize, and operationalize the detection, analysis, and explanation of errors and failures within AI, computational systems, biomedical diagnostics, natural language tasks, and other domains. Distinguished by their explicit focus on observed or inferred error phenomena rather than intended functionality, these taxonomies enable the rigorous identification of defect modalities, direct system-level remediation, inform evaluation protocols, and support automation for scalable quality assurance.

1. Foundational Concepts and Definitions

Error-based diagnostic taxonomies (EDTs) partition observable or inferable system failures, anomalies, or outputs into well-defined categories, typically structured to maximize mutual exclusivity, coverage, and practical usability. The “diagnostic” modifier refers to the explicit mapping from error categories to actionable methods for explanation, triage, quality improvement, or downstream remediation. EDTs can be hierarchically organized, as in technical failure cause cascades or error ontologies, or flat, as in multi-label error lists for classification purposes.

In practice, EDTs are purpose-specific:

For text generation tasks, errors may be defined by information distortion, structural or linguistic failures, or answer misalignments.
For incident analysis in machine learning/AI, failures are classified by system goals, method choices, and proximate technical causes (Pittaras et al., 2022).
In biomedical diagnostics, error-based networks model patterns of misdiagnosis through directed graphs, where nodes (diseases) are connected based on empirical confusion frequencies (Li et al., 2020).

A recurring principle is to tightly couple error typologies with the actionable grain of evaluation, such that the taxonomy directly informs metric design, labeling workflows, and root-cause analysis.

2. Representative Taxonomy Architectures Across Domains

2.1 NLP and Text Generation: ErrEval for Question Generation

ErrEval implements a three-level error-based taxonomy spanning:

Structural errors (Incomplete, Not A Question)
Linguistic errors (Spell Error, Grammar Error, Vague, Unnecessary Copy)
Content-related errors (Off Topic, Factual Error, Information Not Mentioned, Off Target Answer, No Error)

ErrEval’s multi-label classification system operates in a 11-dimensional error space, mapping each generated question to a vector $el \in \{0, 1\}^{11}$ (Fu et al., 15 Jan 2026). Error diagnostics guide subsequent LLM-based evaluation, aligning automatic scoring with human judgment and substantially reducing overestimation of question quality.

2.2 AI Incident Analysis: GMF Cascade Model

The GMF taxonomy decomposes incidents along three top-level dimensions:

AI System Goals (G)
Methods & Technologies (M)
Technical Failure Causes (F)

Each incident is triple-tagged $(g, M, F)$ , where $G$ , $M$ , and $F$ are drawn from goal, method, and cause sets, respectively. Dependency relations $R_{GM}$ (goal-method) and $R_{MF}$ (method-failure) define a directed acyclic graph mapping plausible error pathways. This enables both forward (goal→failure) and backward (failure→method, goal) tracing for root-cause analysis (Pittaras et al., 2022).

2.3 Grammatical Error Diagnosis: Taxonomy Evaluation

Traditional grammatical error taxonomies (e.g. POL73, TUC74, BRY17, FEI23) are structured by morphosyntactic or cognitive complexity levels and are explicitly evaluated for mutual exclusivity, coverage, class balance, and usability. Quantitative assessment uses LLM-simulated overlap metrics, annotator agreement (Cohen’s $\kappa$ ), and model F $_1$ scores. Best-performing taxonomies in terms of diagnostic utility maximize exclusivity (FEI23: 0.848), balance (FEI23: 0.878), and coverage (BRY17: 0.979) (Zou et al., 17 Feb 2025).

3. Taxonomy Construction, Annotation, and Automation

3.1 Category Definition, Data Collection, and Hierarchization

Modern taxonomies utilize a combination of expert-driven definition (bootstrapped via iterative panel review or grounded theory), LLM-assisted error synthesis and validation, and multi-stage annotation. For example, ErrEval’s error identifier is trained with synthetic examples from LLMs, filtered via multi-model agreement at high confidence thresholds, and iteratively refined via uncertainty and consistency metrics (Fu et al., 15 Jan 2026).

AI incident analysis employs staged annotation: first extracting system goals, retrieving historical relevant incidents, then adding method and failure-cause tags, all grounded in explicit text evidence and annotated with confidence levels (Pittaras et al., 2022).

3.2 Automation and Tool Support

A mature taxonomy can serve as the substrate for automation:

Multi-label classifiers: e.g., RoBERTa-based EI modules for question error typing (Fu et al., 15 Jan 2026)
LLM error audits: as in rubric failure diagnosis through LLM-as-judge classification (Qi et al., 1 Apr 2026)
Graph-based root-cause analysis: weighted directed graphs representing disease misdiagnosis with node centrality and edge weights enabling rapid lookups for high-risk confusions (Li et al., 2020)
Dashboarding and trend analysis: error monitoring dashboards record, summarize, and visualize error rates by class, supporting iterative system improvement (Leung et al., 15 Oct 2025)

4. Metrics, Reliability, and Validation

Rigorous error-based taxonomy design incorporates explicit quantitative metrics:

Coverage: Fraction of error instances assignable to concrete taxonomy categories.
Exclusivity: Degree to which instances map uniquely to one category; computed via LLM overlap or annotator voting (Zou et al., 17 Feb 2025).
Class Balance: Entropy-normalized frequency distributions to avoid underpopulated long-tail categories.
Usability: Human reliability (Cohen’s $\kappa$ , IRR) and classifier performance (Macro/Micro F $(g, M, F)$ 0).
Overestimation Reduction: Defined as the rate at which low-quality outputs are marked high-quality by black-box evaluators but are successfully penalized by error-informed frameworks (Fu et al., 15 Jan 2026).

Inter-annotator consistency targets are empirically set (e.g., $(g, M, F)$ 1 for technical tasks) and inform iterative definition refinement. Automated classification metrics (e.g., LLM F $(g, M, F)$ 2 on diagnostic labels) are systematically compared to human ground truth, as in the RIFT rubric taxonomy (LLM F $(g, M, F)$ 3 up to 0.86) (Qi et al., 1 Apr 2026).

5. Application Impacts and Empirical Efficacy

Explicit diagnostic taxonomies consistently demonstrate both operational and statistical improvements:

Evaluation alignment: ErrEval increases correlation with human QG judgments by +7.7–8.2 percentage points over vanilla CoT; reduces overestimation rates on low-quality content by 5–12 pp (Fu et al., 15 Jan 2026).
Guided system improvement: RAG error diagnosis enables pipeline-stage-specific interventions, tracking error distributions and validating error reduction through iterative re-evaluation (Leung et al., 15 Oct 2025).
AI safety and regulation: The GMF taxonomy supports regulatory auditing by surfacing clusters of failures related to “Model Poisoning,” “Overfitting,” or other actionable causes (Pittaras et al., 2022).
Dataset and model development: Task-specific error taxonomies (e.g., in text simplification (Vendeville et al., 22 May 2025)) reveal model bottlenecks unreachable via aggregate performance metrics, motivating the creation of specialized detection and correction tools.

6. Design Principles and Best Practices

Empirical evaluation of existing systems supports several universally applicable design tenets:

Mutual exclusivity: Categories must be clearly delimited, with boundary definitions piloted and quantitatively stress-tested (e.g. LLM overlap $(g, M, F)$ 4 1.2) (Zou et al., 17 Feb 2025).
Comprehensive coverage: All substantive error modes must be represented, minimizing reliance on catch-all or “other” bins (target coverage $(g, M, F)$ 5 0.9).
Granularity calibration: Avoid both overbroad classes (reducing actionability) and overfragmentation (increase annotation ambiguity); target entropy-normalized balance $(g, M, F)$ 6 0.7.
Iterative empirical validation: Employ LLM and human-in-the-loop workflows, ablation studies, and aggregate agreement benchmarks to refine label sets and definitions (Fu et al., 15 Jan 2026, Zou et al., 17 Feb 2025).
Alignment with downstream needs: Taxonomies must be coupled to evaluation goals—scoring, automated diagnosis, reporting, or process improvement.
Transparent documentation: All definitions, annotation criteria, and labeling rules are to be available for reproducibility and extension.

By adhering to these principles, error-based diagnostic taxonomies enable fine-grained, scalable, and reliable error analysis across diverse computational fields, directly supporting robust system design, evaluation, and continual improvement.