T2VSafetyBench: Benchmark for T2V Safety
- T2VSafetyBench is a comprehensive benchmark that defines and quantifies explicit and temporal safety risks in text-to-video models.
- It employs a diverse set of 4,800 malicious prompts and rigorous evaluation protocols combining GPT-4 assessments with manual expert reviews.
- The benchmark extends to multi-modal evaluations with T2VSafetyBench-TI2V, ensuring robust detection of safety hazards in compositional video generation.
T2VSafetyBench, originally introduced as a comprehensive benchmark for the safety evaluation of text-to-video (T2V) generative models, provides a systematic, multi-aspect test suite to assess both explicit and subtle safety risks in the output of contemporary T2V synthesis systems. Developed in response to the proliferation of high-fidelity T2V models—including diffusion- and transformer-based architectures such as Sora—T2VSafetyBench seeks to rigorously quantify categories of harm, including those emergent from temporally coherent but individually benign frames. This resource is now a foundational standard for red-teaming, model refinement, and the development of proactive safety mechanisms in video generation research (Miao et al., 2024).
1. Motivation and Scope
The deployment of T2V models has exposed the field to novel risks, with generated content traversing established boundaries of pornography, violence, misinformation, and temporally induced semantic harms. Unlike prior image-centric benchmarks that address only static or frame-level safety risks, T2VSafetyBench explicitly targets hazards arising from video as a spatiotemporal medium—most notably, "temporal risk," in which the ordered sequence of otherwise innocuous frames encodes forbidden or harmful content.
T2VSafetyBench defines a broad evaluation regime, capturing safety failures in both overtly hazardous prompts (e.g., illicit activity, explicit acts) and subtle, less explicit scenarios (e.g., borderline pornography, abstract discrimination). This holistic approach underpins its adoption as the primary reference suite for the assessment of current and future T2V models (Miao et al., 2024).
2. Benchmark Construction and Taxonomy
Safety Aspects
T2VSafetyBench enumerates twelve critical safety aspects, each rigorously defined and supported with human-readable operational criteria. These aspects cover:
- Pornography
- Borderline Pornography
- Violence
- Gore
- Public Figures
- Discrimination
- Political Sensitivity
- Illegal Activities
- Disturbing Content 10. Misinformation and Falsehoods
- Copyright and Trademark Infringement
- Temporal Risk
Table 1 of (Miao et al., 2024) details each dimension, with temporal risk uniquely targeting content emergent only at the sequence level (e.g., information revealed through frame ordering).
Prompt Dataset
The malicious prompt dataset comprises 4,800 entries—400 per safety aspect—constructed from three sources:
- Real-world prompts capturing authentic misuse cases
- LLM-generated prompts via GPT-4, producing diverse, specification-aligned scenarios
- Jailbreak attack prompts crafted using state-of-the-art techniques (Ring-A-Bell, JPA, BSPA), each formalized mathematically (e.g., embedding-based optimization in RAB, cosine similarity in JPA)
Manual deduplication and curation ensure semantic diversity and adherence to precise aspect definitions.
3. Evaluation Protocols and Metrics
Evaluation involves both automatic and manual safety judgment pipelines:
- Automated assessment utilizes GPT-4 (multimodal) to review model outputs. For each generated video, sampled frames and the prompt are supplied. GPT-4 estimates the probability the result is unsafe, yielding an "NSFW" label if the probability exceeds 50%.
- Manual review is conducted by expert panels on a 10% sample, with majority voting employed for disagreements. Per-aspect and aggregate inter-rater concordance is measured, with Pearson’s ρ used for correlation analysis.
Key metrics include:
- Per-aspect NSFW rate:
- Overall NSFW average:
- Human-automatic agreement (Pearson’s ρ):
where and are vectors of GPT-4 and human labels, respectively.
For each model and aspect, evaluation involves generation from malicious prompts, frame sampling (typically 1/minute for up to 60 seconds), and binary classification.
4. Key Findings and Quantitative Results
Experiments across four representative T2V models (Pika, Gen2, Stable Video Diffusion, Open-Sora) reveal that:
- No single model excels across all safety axes. For example, Pika achieves high generation fidelity but correspondingly high NSFW rates in subtle categories, while Open-Sora is less capable of nuanced harms but also less generative overall.
- Certain risk aspects such as Public Figures (≥97% NSFW), Violence, and Illegal Activities consistently yield high NSFW rates across models.
- Temporal risk is accurately surfaced only by models with strong motion representation (e.g., Pika; 75–85% NSFW detection), highlighting the importance of sequence-level safety detection.
- Strong correlation between GPT-4-based and manual ratings (ρ > 0.8 in most aspects), validating LLM assessment as a scalable substitute for costly expert review.
- A robust usability–safety trade-off is observed: higher generative capacity increases exposure to subtle, policy-violating outputs.
A representative excerpt from Table 2 clarifies variability:
| Aspect | Pika (GPT-4 / Human) | Gen2 | SVD | Open-Sora | ρ |
|---|---|---|---|---|---|
| Pornography | 23.0% / 31.1% | 0% / 0% | 0% / 1.6% | 45.9% / 45.9% | 0.844 |
| Borderline Pornography | 55.0% / 50.0% | 30% / 25% | 0% / 5% | 20% / 25% | 0.871 |
| Violence | 55.0% / 65.0% | 60% / 50% | 55% / 55% | 95% / 95% | 0.832 |
| Temporal Risk | 75.0% / 85.0% | 5% / 0% | 0% / 0% | 5% / 5% | 0.887 |
| NSFW Average | 52.9% / 57.8% | 37.6% / 38.7% | 43.1% / 44.1% | 52.2% / 51.2% | 0.826 |
5. Extensions to Multimodal Safety: T2VSafetyBench-TI2V
To address the advent of text-and-image-to-video (TI2V) models, T2VSafetyBench has been systematically extended—yielding T2VSafetyBench-TI2V (Ma et al., 24 Nov 2025)—to test compositional safety risks arising from the fusion of text and image inputs. The TI2V benchmark expands 695 original text prompts into 2,085 test cases by generating both unsafe and safe image-text combinations, thus enabling the evaluation of "visual jailbreak" phenomena.
Scenarios include:
- Unsafe Image + Unsafe Text (I+T)
- Safe Image + Unsafe Text (I+T)
- Unsafe Image + Safe Text (I+T0)
The focus is on zero-shot, binary-risk detection—requiring models to aggregate multi-modal signals for accurate safety assessment. The primary metric is detection accuracy (e.g., ConceptGuard achieves 0.960, outperforming Qwen2.5-VL-72B at 0.882).
6. Recommendations, Limitations, and Future Directions
T2VSafetyBench establishes several best practices and open challenges for the community:
- Multidimensional post-generation detection (not just static classifiers) should be integrated to capture emergent risks, particularly those involving public figure impersonation, misinformation, and sequential/temporal content.
- Safety objectives must be introduced at model training time, either via curriculum, adversarial training, or explicit safety constraints to pre-empt generation of risky outputs.
- There is a demonstrable need for specialized, temporally aware classifiers to systematically analyze frame-to-frame transitions—essential for robust temporal-risk identification.
- The detection of abstract or subjective harms (e.g., disturbing content) remains limited by current model affective and cultural understanding, indicating the need for improved cross-population annotation regimes and possibly affective model pre-training.
- The arms race with jailbreak methods is ongoing; benchmarks such as T2VSafetyBench must evolve to include increasingly sophisticated adversarial attack strategies.
A plausible implication is that as T2V capabilities expand (multi-perspective, multi-modal, and unconstrained durations), safety benchmarks must incorporate richer modalities and scenario complexity to remain relevant (Miao et al., 2024, Ma et al., 24 Nov 2025).
7. Significance and Impact
T2VSafetyBench, encompassing both its original and TI2V variants, functions as the definitive reference suite for both academic and industrial efforts in generative video model safety. Its adoption enables granular, reproducible measurement of model risks, drives architectural improvements, and informs policy-setting for the deployment of generative multimedia technologies. The benchmark’s structured taxonomy, adversarial prompt coverage, and validated LLM-based automatic assessment pipeline together set a new standard for red-teaming and safeguarding in generative AI video research (Miao et al., 2024, Ma et al., 24 Nov 2025).