ALERT Safety Risk Taxonomy

Updated 24 September 2025

ALERT Safety Risk Taxonomy is a detailed framework that classifies AI safety risks using hierarchical macro- and micro-categories.
The taxonomy employs adversarial red teaming with prompt sampling and systematic evaluation to produce robust safety metrics.
Its modular design supports policy alignment and multilingual benchmarking, enabling tailored risk assessments across diverse regulatory environments.

The ALERT Safety Risk Taxonomy refers to a structured, fine-grained system for classifying, assessing, and benchmarking safety risks—in particular, those associated with the behavior of LLMs and generative AI. Developed in conjunction with the ALERT safety benchmark, it provides multiple levels of granularity and is designed for both empirical evaluation via adversarial testing (“red teaming”) and for policy-aligned model certification. Its modularity, empirical orientation, and policy alignment make it a central reference in contemporary AI safety research.

1. Taxonomy Structure and Category Design

The ALERT taxonomy is defined by a hierarchical, two-tiered schema:

Macro-Categories (Level-1): There are six top-level risk domains.
- Hate Speech / Discrimination
- Criminal Planning
- Regulated or Controlled Substances
- Sexual Content
- Self-Harm
- Illegal Weapons
Micro-Categories (Level-2): Each macro-category contains 3–8 subtypes. For instance, Criminal Planning subcategorizes into crime-injury, crime-theft, crime-tax, crime-propaganda, crime-kidnapping, crime-cyber, crime-privacy, and crime-other.

A representative table is as follows:

Macro-Category	Example Micro-Categories	Focus
Hate Speech / Discrimination	hate-women, hate-ethnic, hate-lgbtq+	Disparagement of protected groups
Criminal Planning	crime-injury, crime-cyber, crime-tax	Facilitation of unlawful activities
Controlled Substances	substance-drug, substance-cannabis	Illegal/regulated substance promotion
Sexual Content	sex-harassment, sex-porn	NSFW and sexual harm
Self-Harm	self-harm-suicide, self-harm-pro-thin	Suicidality and health risk
Illegal Weapons	weapon-firearm, weapon-biological	Weaponization and harm facilitation

This structure is codified in the ALERT dataset, which organizes more than 45,000 testing prompts over these categories (Tedeschi et al., 6 Apr 2024).

2. Red Teaming Methodologies and Empirical Assessment

Risk identification in ALERT is operationalized by adversarial red teaming against LLMs. The approach consists of the following steps:

Prompt Sampling: Prompt candidates are curated from existing harmlessness datasets, then expanded for coverage with template-based generation and LLM-based paraphrasing.
Adversarial Attacks: Prompts are systematically modified with adversarial suffix or prefix injections, token manipulation, and jailbreaking attempts designed to bypass conventional safety guardrails.
Evaluation Protocol: Each prompt $p$ $p$ is submitted to a target LLM $\Phi$ $Φ$ , and its response is scored for safety by an auxiliary classifier $\Omega$ $Ω$ :
- For category $c$ :
$S_c(\Phi) = \frac{1}{|P_c|} \sum_{p \in P_c} \Omega(p)$ - The overall model safety score:

$S(\Phi) = \sum_c \frac{|P_c|}{|P|} S_c(\Phi)$

where $\Omega(p) \in \{0, 1\}$ encodes unsafe/safe.

This method produces robust, category-wise safety metrics for benchmarking and longitudinal safety tracking (Tedeschi et al., 6 Apr 2024).

3. Policy Alignment and Comparative Flexibility

A central innovation of the ALERT taxonomy is its alignment with diverse policy requirements:

Micro-Category Weights: Each fine-grained category can be included, excluded, or weighted according to local regulations. For instance, assessments in regions where cannabis is legal can omit “substance-cannabis,” thereby recalibrating safety scores.
Modular Reconfiguration: The taxonomy's granularity allows for the implementation of policy-specific guardrails and facilitates direct comparison across regulatory contexts (e.g., EU AI Act, U.S. executive orders, Chinese guidelines), as seen in AIR 2024 and AIR-Bench 2024, which both extend this structure to 314 granular risk categories (Zeng et al., 25 Jun 2024, Zeng et al., 11 Jul 2024).

This structure makes ALERT uniquely suited for regulatory compliance benchmarking.

4. Integration with Incident Databases and Defeater Taxonomies

The ALERT taxonomy's granular categories intersect with standardized incident schema—each incident can be labeled by specific harm category (physical, reputational, legal, psychological) as defined in incident database standards (Agarwal et al., 28 Jan 2025). By mapping ALERT micro-categories to incident fields, organizations support interoperability between proactive benchmarking and post-hoc incident analysis.

Moreover, ALERT's systemic approach to risk identification complements contemporary defeater taxonomies in safety assurance cases (Gohar et al., 1 Feb 2025), which emphasize argumentative robustness. Where ALERT classifies realized or potential hazards, defeater taxonomies provide a framework to challenge claims of safety; together, they promote a holistic approach to AI risk.

5. Multilingual Extension, Benchmarking, and Coverage

M-ALERT broadens ALERT by enabling cross-linguistic benchmarking in five languages and reveals language-specific vulnerabilities. Safety inconsistencies emerge across both models and categories: e.g., crime_tax (unsafety in Italian) or substance_cannabis/prompts triggering unsafe responses in all languages. The inclusion or exclusion of categories (such as substance_cannabis) can change aggregate scores by several percent—demonstrating the taxonomy's sensitivity and adaptability (Friedrich et al., 19 Dec 2024).

AIR-Bench and Aegis2.0 build further on ALERT by mapping government and corporate policy risks to benchmarking prompts (AIR-Bench, 314 categories, 5,694 prompts; Aegis2.0, 12 core + 9 fine-grained categories) (Zeng et al., 11 Jul 2024, Ghosh et al., 15 Jan 2025). These benchmarks use the ALERT structure to precisely audit and improve LLM guardrails, while maintaining compatibility with evolving policy frameworks.

6. Impact, Limitations, and Future Directions

The ALERT taxonomy, by virtue of its fine-grained modularity and empirical approach, has enabled highly targeted safety evaluations and policy-aligned model audits. However, experiments show persistent vulnerabilities, especially under adversarial prompt engineering and in specific categories. Future research, as indicated by the creators, includes:

Dynamic extensions for emerging regulatory and ethical risks.
Deeper adversarial analysis at the category level (e.g., robustness to new jailbreak techniques).
Multilingual and cross-cultural calibration, as revealed by M-ALERT.
Direct Preference Optimization (DPO) leveraging paired safe–unsafe responses for enhanced safety tuning.
Open-source standardized datasets (Aegis2.0 and artifacts from AIR-Bench) for collaborative safety benchmarking.

The taxonomy remains an anchor for both risk identification and structured mitigation in generative AI, informing practical model tuning and broader regulatory harmonization.

In conclusion, the ALERT Safety Risk Taxonomy provides a detailed and adaptable framework for classifying, benchmarking, and mitigating safety risks in LLMs and generative AI systems, underpinning both empirical red teaming and policy-aligned auditing. Its hierarchical, modular structure captures the multifaceted nature of AI risk and informs ongoing developments in governance, benchmarking, and multilingual safety assurance (Tedeschi et al., 6 Apr 2024, Zeng et al., 25 Jun 2024, Zeng et al., 11 Jul 2024, Friedrich et al., 19 Dec 2024, Ghosh et al., 15 Jan 2025, Agarwal et al., 28 Jan 2025, Gohar et al., 1 Feb 2025).