Taxonomy for Evaluation Criteria

Updated 13 October 2025

Taxonomy for Evaluation Criteria is a structured framework that categorizes and defines evaluation metrics to enhance comparability, reproducibility, and transparency.
It guides practitioners by offering unambiguous definitions, hierarchical structures, and adaptable dimensions across domains.
The framework supports standardized measurement and regulatory compliance, advancing auditability and methodological innovations in evaluation practices.

A taxonomy for evaluation criteria is a structured framework for systematically categorizing and describing the properties, processes, or metrics by which systems, models, or artifacts are evaluated. Such taxonomies promote rigor, comparability, and reproducibility in scientific and engineering domains, and facilitate the synthesis, communication, and advancement of evaluation practices across diverse fields. The specific schemes, dimensions, and terminology vary widely according to application area, but leading research establishes several converging principles on the organization and utility of evaluation taxonomies.

1. Fundamental Concepts and Objectives

Evaluation criteria taxonomies are developed to address the fragmentation and ambiguity that arise when diverse laboratories, research groups, or industrial actors introduce non-standardized or context-dependent quality metrics. The overarching aims are:

To achieve comparability of evaluation results across studies;
To guide practitioners in designing robust, repeatable, and interpretable evaluation procedures;
To support regulatory and compliance activities by ensuring transparency of reported metrics.

Key properties of a well-constructed evaluation criteria taxonomy include unambiguous definitions, coverage of both qualitative and quantitative aspects, hierarchical structuring by frame of reference, and adaptability across tasks or domains (Belz et al., 26 Sep 2025, Unterkalmsteiner et al., 29 Feb 2024, McCormack et al., 10 Oct 2024).

2. Taxonomic Structures: Hierarchies, Dimensions, and Scoping

Several leading taxonomies organize evaluation criteria along principled axes or hierarchies:

Taxonomy	Primary Dimensions	Description
QCET (Belz et al., 26 Sep 2025)	Frame of reference (output in its own right, relative to input, target, external), Quality type (correctness, goodness, feature), Aspect (form/content/whole)	Explicit hierarchical mapping for NLP evaluation
TAI (McCormack et al., 10 Oct 2024)	TAI principle: fairness, robustness, privacy, transparency, agency, well-being, accountability	Seven-principle framework tied to EU guidelines
SlideAudit (Zhang et al., 5 Aug 2025)	High-level domain area (layout, typography, color, imagery, animation) → specific flaw category	Graphic design flaw detection
Cloud Eval. (Li et al., 2013)	Performance Feature (property/capacity), Experimental Scene (environmental/operational)	Empirical decomposition for cloud benchmarking

Hierarchical taxonomies often assign unique node IDs to each criterion and provide named, domain-grounded definitions at each level. For example, QCET uses a triply nested structure (frame of reference → quality type → aspect) to map the numerous, sometimes ambiguous, “quality criterion names” in the NLP literature to precise meanings (Belz et al., 26 Sep 2025), thus resolving issues where “fluency” or “naturalness” as reported in different papers are non-comparable.

In design evaluation (e.g., SlideAudit), taxonomy development follows domain theory (e.g., Gestalt principles) with expert-driven iteration to divide subjective flaws into mutually exclusive, actionable categories (Zhang et al., 5 Aug 2025).

In domains requiring multi-stakeholder or system-level assessment, taxonomy structuring may also integrate actor roles or system lifecycle mappings (e.g., AI safety frameworks mapping evaluation criteria to roles such as developer, deployer, or user, and to lifecycle stages) (Xia et al., 8 Apr 2024, McCormack et al., 10 Oct 2024).

3. Criteria, Metrics, and Measurement Approaches

Taxonomies not only name evaluation criteria but often specify measurement techniques, whether these are direct quantitative metrics, qualitative judgments, or composite indicators. Leading approaches distinguish between:

Internal vs. External Measurements: Internal measures are derivable from the taxonomy or artifact itself (e.g., count of categories), while external measures are observed when the taxonomy is applied to real data (Unterkalmsteiner et al., 29 Feb 2024).
Subjective vs. Objective: Some measures require expert or user judgment (e.g., annotator fluency ratings in NLP or design flaw recognition), others are computationally defined (e.g., classification accuracy, average response time, semantic proximity via embeddings).
Direct vs. Derived: Direct measures count features or apply tests directly, whereas derived measures calculate ratios, correlations, or statistics from observed data or internal properties.

Some frameworks, such as GANTEE, also blend discriminative metrics (effectiveness, e.g., mean rank, hit@k) with efficiency metrics (time to process) and correctness/precision metrics (accuracy, F1) (Gu et al., 2023).

Mathematical formulas underpin some criteria:

For example, robustness in taxonomy assessment is measured as

$R(T) = \frac{\sum_{i=1}^{n_{groups}} \left(1 - \frac{n_{ic}}{n_{gc} \times (n_{ac} - n_{gc})}\right)}{n_{groups}}$

where $n_{ic}$ is the number of intruder characteristics, $n_{gc}$ is the number of characteristics in group $i$ , and $n_{ac}$ is the total number of characteristics (Unterkalmsteiner et al., 29 Feb 2024).

Hierarchical precision and recall for taxonomy-aware evaluation in vision-LLMs are given by

$hP = \frac{1}{N} \sum_{n=1}^{N} \frac{|(p_n) \cap (t_n)|}{|(p_n)|}, \quad hR = \frac{1}{N} \sum_{n=1}^{N} \frac{|(p_n) \cap (t_n)|}{|(t_n)|}$

where $(p_n)$ and $(t_n)$ denote the path from root to predicted/true node (Snæbjarnarson et al., 7 Apr 2025).

4. Applications and Case Studies

Evaluation taxonomies perform foundational roles across scientific and engineering domains:

NLP and NLG: The QCET taxonomy enables researchers to disambiguate “fluency,” “accuracy,” and related QCs, thereby supporting directly comparable results across independent experiment suites. When mapping a given paper’s definition of “naturalness” to the QCET node for “Native Speaker Likeness,” ambiguities can be resolved, enabling meaningful cross-paper synthesis and meta-evaluation (Belz et al., 26 Sep 2025).
AI Safety and Trustworthiness: The seven-principle taxonomy from TAI frameworks allows regulatory or organizational actors to select relevant, reportable, and auditable metrics spanning fairness, robustness, privacy, transparency, human agency, societal impact, and accountability. It also captures trade-offs and interdependencies between criteria—such as how improvements in fairness sometimes reduce robustness (McCormack et al., 10 Oct 2024, Xia et al., 8 Apr 2024).
Design Evaluation: SlideAudit’s taxonomy-driven detection of presentation slide flaws demonstrated that LLMs provided more precise remediation plans and higher flaw detection F1 scores (by 33%) when explicitly prompted using the taxonomy’s categories, underscoring the practical value of structured, standard criteria (Zhang et al., 5 Aug 2025).
Performance Benchmarking: Cloud evaluation taxonomies (e.g., property/capacity × env/operation scene) have standardized the decomposition and reporting of empirical results by defining all relevant configuration variables and what is being measured, e.g., availability, latency, scalability, variability (Li et al., 2013).
Mental Health AI: Psychotherapy AI risk taxonomies integrate clinical standards (e.g., DSM-5) into AI evaluation via immediate/potential risk trees, explicit cognitive model factor tracking (e.g., change in “hopelessness” by ≥1 on a 1–10 scale), and dynamic monitoring over conversational sessions (Steenstra et al., 21 May 2025).

5. Impact, Standardization, and Open Challenges

Standardized taxonomies for evaluation criteria have demonstrable impacts:

Repeatability and Comparability: Clear mappings of criteria to uniquely defined nodes permit reliable meta-analysis and scientific reproducibility. Nonstandard or ambiguous QCs (e.g., “fluency” in NLP) have hampered progress due to lack of comparability; taxonomy-driven naming and definition mapping provide a direct solution (Belz et al., 26 Sep 2025).
Design Guidance and Auditability: Taxonomies serve as checklists for new evaluation designs and as frameworks for audit and compliance, ensuring that system evaluations are exhaustive, transparent, and interpretable by third parties (McCormack et al., 10 Oct 2024).
Domain Adaptability and Expansion: Taxonomies must be extensible to new use-cases. For example, the taxonomy for cloud service evaluation supports smooth expansion (e.g., by appending security or cost dimensions as research demands), and AI safety frameworks integrate evolving lifecycle mapping to stakeholders (Li et al., 2013, Xia et al., 8 Apr 2024).
Trade-offs and Interdependencies: A central challenge for taxonomy-based evaluation is the tension among competing dimensions (e.g., trade-offs between fairness and accuracy, extensibility and stability, conciseness and comprehensiveness). A robust taxonomy explicitly marks such interdependencies (McCormack et al., 10 Oct 2024, Unterkalmsteiner et al., 29 Feb 2024).

Persistent barriers include uneven metric maturity across dimensions, lack of interdependency metrics in general use, and difficulties in quantifying certain qualitative, context-sensitive criteria (e.g., societal or environmental well-being in TAI, or personalization in NLE explanations) (McCormack et al., 10 Oct 2024, Nejadgholi et al., 11 Jul 2025).

6. Methodological Innovations and Future Prospects

Recent research introduces methodological advances:

Automatic, Label-Free Scoring: RaTE leverages masked LLMs to produce label-free, reproducible evaluation of taxonomic structure, sidestepping the cost and inconsistency of manual evaluation and achieving strong correlation with human ranking (Gao et al., 2023).
LLM-Driven Hierarchical Evaluation: LiTe segments large taxonomies into manageable subtrees, employs cross-validation and penalty mechanisms for extreme cases, and delivers robust, quantitative and qualitative assessment aligned with human expert annotation, supporting scalable, automated evaluation pipelines (Zhang et al., 2 Apr 2025).
Multi-facet Assessment: Modern taxonomies often split criteria across content, presentation, and user-centered properties to better capture real-world desiderata (e.g., correctness, compactness, actionability, personalization) (Nejadgholi et al., 11 Jul 2025).

A plausible implication is that evaluation criteria taxonomies will increasingly integrate automated measures with human-in-the-loop oversight, domain-theoretic grounding, dynamic extensibility, and computational traceability as sociotechnical systems grow in scale and impact.

7. Summary Table: Exemplar Taxonomy Structures

Field/Task	Taxonomy Reference	Primary Axes/Branches
NLP/NLG	QCET (Belz et al., 26 Sep 2025)	Four frames (own right, input-relative, target-relative, external) × Quality type × Aspect
TAI	(McCormack et al., 10 Oct 2024)	Seven EU principles (fairness, robustness, etc.), subdivided into metric-based and qualitative
Slide Design	SlideAudit (Zhang et al., 5 Aug 2025)	Dimensions (layout, typography, color, imagery, animation) → 19–27 categorized flaws
Cloud Service Eval	(Li et al., 2013)	Performance feature (property/capacity) × Experimental scene (environmental/operational)
Pedagogical LLMs	(Maurya et al., 12 Dec 2024)	Eight pedagogical dimensions (identification, location, guidance, actionability, etc.)
AI Trust/Safety	(Xia et al., 8 Apr 2024)	Component and system-level attributes; safety guardrails; lifecycle mapping to stakeholders

These architectures exemplify the centrality of taxonomy for evaluation criteria: providing a common language, facilitating reproducibility and comparability, supporting audit and governance, and serving as a foundation for responsible technological innovation.