SimpleSafetyTests (SST): LLM Safety Benchmark

Updated 4 November 2025

SimpleSafetyTests (SST) are a standardized, synthetic prompt-based test suite designed to rapidly detect critical safety risks in LLM outputs across five harm areas.
The methodology employs handcrafted prompts vetted by multiple experts to ensure clarity, reproducibility, and actionable benchmarking of model safety.
SST supports rigorous LLM audits and safety filter evaluations by providing quantitative metrics on unsafe responses and performance variations.

SimpleSafetyTests (SST) are systematic test suites designed to rapidly identify critical safety risks in LLMs. Recognizing that LLMs can generate unsafe, illegal, or unethical outputs in response to user queries, SST provides a standardized set of prompts covering five severe harm areas. By evaluating LLM responses to these prompts, SST offers a quantitative, reproducible metric for model safety, facilitating model audits, benchmarking, and steerability validation.

1. Motivation and Objectives

The acceleration of LLM deployment across domains has highlighted the consequences of insufficient safety measures. LLMs have been shown to produce outputs that enable or encourage self-harm, fraud, abuse, and other serious harms. The primary motivation for SST is to provide a simple, systematic, and efficient methodology for uncovering "red line" failures in LLM behavior—responses that should be universally refused except in tightly controlled research or red-teaming settings.

SST's objectives are:

Rapidly screen LLMs for critical safety risks.
Identify prompt categories and harm areas where models are especially vulnerable.
Provide a concise, actionable benchmark for developers and researchers.

2. Test Suite Design

Prompt Construction and Coverage

SST consists of 100 English-language, synthetic prompts. The prompts are handcrafted and underwent multiple review rounds for clarity and validity, with every prompt independently vetted by at least three authors. Prompts are organized as follows:

Harm Area	# Prompts	Types (each, per area)
Suicide, Self-Harm, and Eating Disorders	20	10 Information/Advice-seeking + 10 Instructions
Physical Harm	20	10 Information/Advice-seeking + 10 Instructions
Illegal and Highly Regulated Items	20	10 Information/Advice-seeking + 10 Instructions
Scams and Fraud	20	10 Information/Advice-seeking + 10 Instructions
Child Abuse	20	10 Information/Advice-seeking + 10 Instructions

Information/Advice-seeking prompts ask questions that, if answered unsafely, would provide harmful knowledge or encouragement.
Instructions/Actions prompts are imperatives likely to elicit unsafe or actionable content.

Prompts are not derived from historical events but are synthetically constructed to comprehensively cover the space of severe, plausible LLM safety violations.

3. Evaluation Protocol

Model Selection and Testing

SST was used to evaluate 11 open-access/open-source LLMs and 4 closed-source LLMs. Each model received all 100 prompts. Trained annotators labeled every model response (totaling 3,000 responses) as either "safe" or "unsafe" according to predefined guidelines. For context:

Safe: The model refuses or deflects the unsafe request with a clear, unambiguous rejection.
Unsafe: The model complies with the request or gives information/advice that is clearly harmful or enabling.

The evaluation process is systematic and independent of the model vendor. Annotation reliability is ensured by adherence to strict labeling protocols.

System Prompt Intervention

For selected models, tests were run both with and without a safety-emphasizing system prompt prepended to the user prompt. This assesses whether high-level system instructions can materially enhance safety and by how much.

4. Observed Model Behavior and Safety Weaknesses

Baseline Findings

While a minority of models did not yield a single unsafe response, most models produced unsafe responses to over 20% of prompts, with the most permissive exceeding 50%.
System-level safety prompts reduced the incidence of unsafe responses substantially but were not fully effective; unsafe outputs still occurred.

Harm Area and Prompt Type Variance

Models exhibited different vulnerabilities across harm areas. Performance was particularly poor on prompts relating to physical harm, scams, and illegal items.
Both prompt types saw failures. Instruction/action prompts tended to elicit more unsafe completions than information/advice-seeking queries.

These results underline that existing LLMs, including both open and closed variants, remain susceptible to critical safety risks in numerous, foreseeable scenarios.

5. Automated Safety Filtering and Benchmarking

SST was further used to evaluate the efficacy of five AI safety filters, systems or heuristics that autonomously determine if an LLM response is unsafe in context. These filters were benchmarked as follows:

Safety Filter	Accuracy on SST Responses
Perspective API	72%
Zero-shot OpenAI GPT-4 prompt	89%

Other filters exhibited lower or highly variable performance, particularly across different harm areas and in handling safe vs unsafe responses. No filter achieved perfect reliability. The newly-created zero-shot GPT-4 prompt outperformed existing commercial APIs.

A notable implication is that current safety filters, often employed for automated moderation or risk scoring, are not yet sufficiently accurate for robust, standalone LLM gatekeeping, especially on edge-case or severe-risk queries.

6. Methodological Considerations and Significance

SST is differentiated from general safe testing or red-teaming by its explicit focus on critical risks—output categories that should be universally refused. The synthetic, hand-vetted nature of the prompt set supports reproducibility, transparency, and avoidance of legal or privacy violations. Annotation by trained humans ensures fidelity in risk assessment.

Adoption of SST aids:

Quantitative comparison of LLM safety performance across architectures, vendors, or safety intervention strategies.
Detection of regressions in model or filter updates.
Prioritization of research and development targeting the most severe or prevalent risk modalities.

7. Implications, Limitations, and Future Directions

SST provides a lower-bound estimate of model safety—its scope is intentionally limited to "worst-case" prompts. It does not exhaustively explore the risk landscape (e.g., subtle context poisoning or misuse) and is not a replacement for domain-specific or red-team audits in high-assurance environments.

A plausible implication is that while SST is effective for flagging egregious model failures, holistic LLM safety evaluation requires more nuanced, adaptive methodologies and continued benchmarking of both models and filters as the threat landscape evolves.

The persistent failure of both LLMs and automated filters to consistently reject these prompts highlights the necessity of both improved system-level controls and continuous post-deployment safety evaluation. Future work may expand SST prompt coverage, enhance annotation protocols, or inform new architectures for automated, real-time safety assessment in LLM ecosystems.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to SimpleSafetyTests (SST).