Core Metrics & Error Taxonomies

Updated 5 June 2026

Core metrics are quantitative measures assessing the gap between predicted and target values, providing a foundation for robust model evaluation.
Error taxonomies systematically classify failure types by severity, source, and stage, enabling targeted diagnostic and improvement strategies.
Together, these tools offer a multidimensional framework for performance analysis across domains like machine learning, NLP, and computer vision.

Core metrics and error taxonomies are foundational tools for evaluating, understanding, and improving automated systems across domains as varied as machine learning, natural language processing, computer vision, information retrieval, data management, and LLMs. This article surveys the principal metric forms, their mathematical underpinnings, and the structure and utility of error taxonomies, drawing on recent technical literature to outline best practices and open challenges.

1. Foundational Principles: Definitions and Structural Typologies

Core metrics are quantitative measures capturing the discrepancy between system predictions and references, targets, or formal requirements. Error taxonomies are systematic classifications of the types, loci, or severities of error that arise within a given system or application. The interplay between granular error measures and taxonomy-driven analysis underpins both model development and robust evaluation.

In the context of regression and forecasting, Botchkarev (Botchkarev, 2018) decomposes performance metrics into four families: primary metrics; extended metrics (with post-aggregation normalization); composite metrics (ratios or combinations of primary metrics); and hybrid sets (collections of complementary metrics). A primary metric is parameterized by the point distance $\delta(y_i, \hat y_i)$ , normalization $N(y_i)$ , and aggregation $A(\cdot)$ . For classification, hard-decision versus score-based frameworks are unified under the expected-cost formalism, with errors further stratified by their operational, semantic, or statistical consequences (Ferrer, 2022).

Error taxonomies extend formal metric analysis by specifying classes of failure (e.g., error type, cognitive root, pipeline stage, attribute), providing an ontological structure for both reporting and mitigation (Ashury-Tahan et al., 22 Jan 2026, Bhadauria et al., 10 Apr 2026, Vendeville et al., 22 May 2025).

2. Metric Construction and Representative Examples

2.1. Regression and Forecasting

Primary error metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE):

$\mathrm{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i-\hat y_i|,\quad \mathrm{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i-\hat y_i)^2$

Relative and scale-invariant variants, such as Mean Absolute Percentage Error (MAPE) or Mean Absolute Scaled Error (MASE), adjust for scale or permit cross-series comparison. Composite metrics, such as the Reference Index or $R^2$ coefficient of determination, combine basic metrics to summarize overall fit, variance explained, or external predictivity (Botchkarev, 2018, Naser et al., 2020).

2.2. Classification

Classification error is traditionally measured by accuracy/standard error rate (ACC/ER), balanced accuracy/error, $F_\beta$ score, Matthews correlation coefficient (MCC), and cost-sensitive generalizations (Expected Cost, EC):

$\mathrm{ER} = 1-\mathrm{ACC} = \frac{1}{N} \sum_{i \neq j} N_{ij}$

For evaluation of class probability scores, proper scoring rules such as cross-entropy and Brier Score are canonical (Ferrer, 2022).

2.3. Taxonomic, Hierarchical, and Semantic Structures

Taxonomy-aware evaluation incorporates metrics such as the Concept Similarity Correlation (CSC), which measures the rank-correlated monotonicity between semantic similarity (from embeddings) and taxonomic similarity (from tree structure):

$W_{ij} = \frac{2 \cdot \mathrm{lca\_depth}(p(c_i), p(c_j))}{|p(c_i)| + |p(c_j)|},\quad S_{ij} = \langle e(D(c_i)), e(D(c_j)) \rangle_{\mathrm{cos}}$

$\mathrm{CSC} = \tau(\{ W_{ij} \}, \{ S_{ij} \})$

Logical adequacy (NLIV) is measured via pathwise geometric-mean entailment probabilities under a Natural Language Inference model (Wullschleger et al., 16 May 2025).

In multi-label or hierarchical classification, metrics such as Hierarchical Overlap Score (HOS) and Hierarchical Distance Score (HDS) provide robust alternatives to flat F1, awarding partial credit to near-miss predictions and penalizing catastrophic cross-branch errors (Schaper et al., 21 Jan 2026).

2.4. Automatic Speech and Text Generation

Standard metrics such as Word Error Rate (WER) and Character Error Rate (CER) are used for ASR and text generation, with new paradigms like minimum edit distance (minED) bridging between black-box embedding-based metrics and interpretable error rates:

$\mathrm{minED}_m(H, R) = \min \{ k \geq 0 : \exists H' \in E_k(H),\ m(H', R) \leq \theta \}$

This supports the alignment of error measurements with human acceptability and subjective criticality, stratified by error type and linguistic class (Bañeras-Roux et al., 5 May 2026).

3. Taxonomy Design: Structure, Metrics, and Validation

Taxonomies are operationalized along dimensions such as exclusivity (mutual disjointness of categories), coverage (proportion of data assignable to a specific class), balance (entropy/uniformity of class instance distribution), and usability (learnability for models and annotators) (Zou et al., 17 Feb 2025). Metric definitions include:

Exclusivity: For sample $N(y_i)$ 0 with $N(y_i)$ 1 candidate labels $N(y_i)$ 2 and threshold $N(y_i)$ 3,

$N(y_i)$ 4

Coverage: $N(y_i)$ 5
Balance (normalized entropy): $N(y_i)$ 6
Usability: Macro-/micro-F1 and inter-annotator $N(y_i)$ 7

In automatic text simplification evaluation (Vendeville et al., 22 May 2025), error taxonomies structured by fluency, alignment, information, and simplification are the basis for AUROC- and AUPRC-based metric benchmarking, with empirical error detection ceilings far below human performance.

Domain-specific taxonomies have been deployed and validated via systematic annotation, statistical agreement, and empirical coverage statistics across language error detection (Zou et al., 17 Feb 2025), LLM error analysis (Ashury-Tahan et al., 22 Jan 2026), retrieval-augmented generation (Leung et al., 15 Oct 2025), and tabular data error cataloging (Bhadauria et al., 10 Apr 2026).

4. Error Taxonomies in LLMs, Retrieval, and Data Management

4.1. LLM Error Signatures

ErrorMap and ErrorAtlas construct model-agnostic error profiles consisting of error rate, failure signature (category prevalence), coverage, taxonomy accuracy, and robustness to prompt variation. ErrorAtlas's 17-class top-level taxonomy includes logical reasoning, missing elements, computation, identification, specification, output formatting, factuality, incomplete reasoning, and domain-specific failures. These structures enable differential debugging, benchmarking, and targeted improvement (Ashury-Tahan et al., 22 Jan 2026).

4.2. Retrieval-Augmented Generation

RAG systems present 16 error types across chunking, retrieval, reranking, and generation stages. RAGEC employs simple interpretable metrics (retrieval/recall, coverage, self-consistency via LLM voting) to automatically align errors with their pipeline stage and enable targeted intervention (Leung et al., 15 Oct 2025).

4.3. Tabular Data Error Catalogs

A comprehensive catalog aggregates 35 error types under three categories—missing, incorrect, redundant—with formal cell-, tuple-, and attribute-level error rates. Statistical error indicators (outliers and bias) are explicitly formalized via distributions (e.g., z-scores, divergence) (Bhadauria et al., 10 Apr 2026). This framework supports systematic data profiling and preemptive quality controls in data-driven systems.

5. Taxonomy-Guided Metrics in Advanced and Safety-Critical Systems

5.1. Consistency and Confidence Frameworks

Novel error taxonomies such as DECK partition LLM hallucinations along axes of inter-sample consistency (semantic entropy, pairwise agreement) and token-level confidence (sequence/log-probability, minimum token probability, negentropy). A 2×2 grid (Drift, Entrenched, Confabulation, Knotted) operationalized by Youden’s J thresholding captures the detectability regime for each error type, mapping to different uncertainty quantification families (black-box, white-box, judge) and exposing persistent “universal” blind spots in output-level UQ (Chauhan, 1 Jun 2026).

5.2. Hierarchical and Risk-Adjusted Evaluation

Vision–LLMs, especially in medicine, mandate metrics such as Hierarchical Overlap Score (HOS), Hierarchical Distance Score (HDS), and Catastrophic Abstraction Error (CAE) incidence. CAEs—coupled to the depth of least-common-ancestor in the taxonomy—quantify semantically or clinically severe cross-branch misclassifications. Risk-constrained thresholding and taxonomy-structured fine-tuning (e.g., via radial embeddings aligned to ontology topology) achieve marked reductions in catastrophic error while maintaining discriminative power (Schaper et al., 21 Jan 2026).

6. Experimental Protocols, Limitations, and Directions

Robust metric and taxonomy validation demands multi-dataset evaluation, random degradation protocols, permutation-based statistical assessment, and inter-annotator agreement reporting. For reference-free or unsupervised evaluation, new metrics such as CSC and NLIV demonstrate strong empirical correlation with benchmark F1, with rigorous permutation testing substantiating significance (Wullschleger et al., 16 May 2025). In contrast, for text simplification, even the best automated detectors fail to capture rare or nuanced error subtypes (Vendeville et al., 22 May 2025).

Principal challenges in metric selection and taxonomy design include: (i) sensitivity to outlier errors and non-leaf misplacements; (ii) the impact of domain-prior or annotation conventions; (iii) presence of hard-to-detect failure regimes; (iv) high prevalence and low detection rate of subtle or composite errors; and (v) the need for protocol- or domain-tuned error granularity.

7. Practical Recommendations and Best Practices

Metrics must be selected to match the invariances, loss structures, and deployment risks of the domain: cost matrices for classification (Ferrer, 2022), semantic/taxonomic alignment in hierarchical annotation (Schaper et al., 21 Jan 2026), or criticality-weighted edit distance in ASR (Bañeras-Roux et al., 5 May 2026).
Hybrid sets—collections of complementary metrics—should be reported for a multidimensional assessment (Botchkarev, 2018, Naser et al., 2020).
Taxonomies should be empirically validated across exclusivity, coverage, balance, and usability, with entropy analyses guiding granularity and annotated “other” rates used as signals for refinement (Zou et al., 17 Feb 2025).
Error signatures and taxonomies enrich model evaluation with actionable diagnostics, enabling both fine-grained benchmarking and error localization for iterative improvement (Ashury-Tahan et al., 22 Jan 2026, Leung et al., 15 Oct 2025).
In high-risk or safety-critical domains, enforce constraints on severe error-rates (e.g., CAE incidence), calibrate thresholds for operational risk, and align latent representation learning with the semantic structure of the error taxonomy (Schaper et al., 21 Jan 2026).
For emerging phenomena such as hallucinations or factuality in generative models, taxonomies stratified along detectability or epistemic axes (consistency/confidence) are essential for rational ensemble design and UQ benchmarking (Chauhan, 1 Jun 2026).

By unifying the plurality of established and recent metrics with well-defined error taxonomies, technical reporting shifts from surface-level task accuracy to an interpretable, actionable, and domain-tuned understanding of system behavior and failure. This enables targeted development, risk control, and principled progress across AI-driven disciplines.