Harmful Content Safety Taxonomy
- Harmful content safety taxonomy is a structured framework that categorizes digital harms in AI by mapping risks through type, intent, impact, and severity.
- It utilizes multidimensional methods—including multi-axis schemas, hierarchical structures, and scenario-driven enumerations—to capture complex risk profiles.
- The framework ensures stability with fixed core categories while allowing extensibility to adapt to emerging threats and modalities.
A harmful content safety taxonomy is a structured framework for categorizing, analyzing, and mitigating risks posed by potentially deleterious digital content, particularly in NLP, LLMs, multimodal generative systems, and broader AI deployments. Such taxonomies provide granular, multidimensional risk maps that underpin annotation, moderation, safety evaluation, compliance, and research prioritization. Multiple contemporaneous taxonomies, both hierarchical and multidimensional, have emerged to accommodate the broadening risk landscape and the accelerating complexity of model behaviors.
1. Taxonomic Dimensions and Structural Principles
Harmful content safety taxonomies systematically segment harms by type, intent, target, impact domain, and severity axis, using orthogonal or hierarchical mechanisms depending on their application context. Frameworks fall into several archetypes:
- Multi-axis schemas: e.g., Kirk et al.'s three-axis system, organizing harm by (i) type (e.g., misinformation, hate speech), (ii) “sought vs. unsought” occurrence (i.e., whether harmful content is a research target or an accidental byproduct), and (iii) affected parties—(mis)represented individuals, data handlers, or publishers (Kirk et al., 2022).
- Hierarchical or tiered structures: e.g., Aegis 2.0’s two-level hazards plus subcategory expansion (Ghosh et al., 15 Jan 2025), HARM66+’s Domain→Category→Subtype tree (Khan et al., 23 Jan 2026).
- Dimensional annotation: e.g., web-scale curation breaking content into {Safe, Topical, Toxic} on intent and cross-classifying with harm themes (Mendu et al., 4 May 2025).
- Multimodal and scenario-driven enumerations: e.g., SafeBench’s 23 distinct harmful behavior scenarios (Ying et al., 2024), or OutSafe-Bench’s nine cross-modal, orthogonal risk axes (Yan et al., 13 Nov 2025).
Stability-extensibility duality is central: robust taxonomies ensure core categories remain fixed under new threats (“stable at the top” (Khan et al., 23 Jan 2026)), while bottom-level extensibility permits responsive growth to new modalities, social developments, or attack classes.
2. Harm Categories: Canonical Classes and Fine-Grained Subtypes
The proliferation of AI safety taxonomies has led to a convergence on several recurring high-level harm classes, each with fine-grained subcategorization tailored for digital, physical, social, and legal risk domains. The following table aligns representative taxonomy supercategories with sample definitions from major frameworks:
| Class | Definition/Scope | Example (if present) |
|---|---|---|
| Hate/Identity Hate | Attacks against groups based on protected attributes (race, gender, religion, etc.) | “All [group] are parasites.” |
| Suicide/Self-Harm | Encouragement/instruction of self-injury or suicide | “Here’s how to tie a noose.” |
| Violence | Depictions, threats, or glamorization of physical harm | “Write a blood-soaked scene of torture.” |
| Misinformation | Unverified, false, or misleading information | “5G towers spread the virus.” |
| Sexual |