Papers
Topics
Authors
Recent
Search
2000 character limit reached

T2VSafetyBench: Benchmark for T2V Safety

Updated 10 April 2026
  • T2VSafetyBench is a comprehensive benchmark that defines and quantifies explicit and temporal safety risks in text-to-video models.
  • It employs a diverse set of 4,800 malicious prompts and rigorous evaluation protocols combining GPT-4 assessments with manual expert reviews.
  • The benchmark extends to multi-modal evaluations with T2VSafetyBench-TI2V, ensuring robust detection of safety hazards in compositional video generation.

T2VSafetyBench, originally introduced as a comprehensive benchmark for the safety evaluation of text-to-video (T2V) generative models, provides a systematic, multi-aspect test suite to assess both explicit and subtle safety risks in the output of contemporary T2V synthesis systems. Developed in response to the proliferation of high-fidelity T2V models—including diffusion- and transformer-based architectures such as Sora—T2VSafetyBench seeks to rigorously quantify categories of harm, including those emergent from temporally coherent but individually benign frames. This resource is now a foundational standard for red-teaming, model refinement, and the development of proactive safety mechanisms in video generation research (Miao et al., 2024).

1. Motivation and Scope

The deployment of T2V models has exposed the field to novel risks, with generated content traversing established boundaries of pornography, violence, misinformation, and temporally induced semantic harms. Unlike prior image-centric benchmarks that address only static or frame-level safety risks, T2VSafetyBench explicitly targets hazards arising from video as a spatiotemporal medium—most notably, "temporal risk," in which the ordered sequence of otherwise innocuous frames encodes forbidden or harmful content.

T2VSafetyBench defines a broad evaluation regime, capturing safety failures in both overtly hazardous prompts (e.g., illicit activity, explicit acts) and subtle, less explicit scenarios (e.g., borderline pornography, abstract discrimination). This holistic approach underpins its adoption as the primary reference suite for the assessment of current and future T2V models (Miao et al., 2024).

2. Benchmark Construction and Taxonomy

Safety Aspects

T2VSafetyBench enumerates twelve critical safety aspects, each rigorously defined and supported with human-readable operational criteria. These aspects cover:

  1. Pornography
  2. Borderline Pornography
  3. Violence
  4. Gore
  5. Public Figures
  6. Discrimination
  7. Political Sensitivity
  8. Illegal Activities
  9. Disturbing Content 10. Misinformation and Falsehoods
  10. Copyright and Trademark Infringement
  11. Temporal Risk

Table 1 of (Miao et al., 2024) details each dimension, with temporal risk uniquely targeting content emergent only at the sequence level (e.g., information revealed through frame ordering).

Prompt Dataset

The malicious prompt dataset comprises 4,800 entries—400 per safety aspect—constructed from three sources:

  • Real-world prompts capturing authentic misuse cases
  • LLM-generated prompts via GPT-4, producing diverse, specification-aligned scenarios
  • Jailbreak attack prompts crafted using state-of-the-art techniques (Ring-A-Bell, JPA, BSPA), each formalized mathematically (e.g., embedding-based optimization in RAB, cosine similarity in JPA)

Manual deduplication and curation ensure semantic diversity and adherence to precise aspect definitions.

3. Evaluation Protocols and Metrics

Evaluation involves both automatic and manual safety judgment pipelines:

  • Automated assessment utilizes GPT-4 (multimodal) to review model outputs. For each generated video, sampled frames and the prompt are supplied. GPT-4 estimates the probability the result is unsafe, yielding an "NSFW" label if the probability exceeds 50%.
  • Manual review is conducted by expert panels on a 10% sample, with majority voting employed for disagreements. Per-aspect and aggregate inter-rater concordance is measured, with Pearson’s ρ used for correlation analysis.

Key metrics include:

  • Per-aspect NSFW rate:

NSFWm,k=1Ni=1N1{ym,k,i>0.5}\mathrm{NSFW}_{m,k} = \frac{1}{N}\sum_{i=1}^N\mathbf{1}\{y_{m,k,i}>0.5\}

  • Overall NSFW average:

NSFWmavg=112k=112NSFWm,k\mathrm{NSFW}_m^\mathrm{avg} = \frac{1}{12}\sum_{k=1}^{12}\mathrm{NSFW}_{m,k}

  • Human-automatic agreement (Pearson’s ρ):

ρk=Cov(Gk,Hk)σ(Gk)σ(Hk)\rho_k = \frac{\mathrm{Cov}(G_k, H_k)}{\sigma(G_k)\,\sigma(H_k)}

where GkG_k and HkH_k are vectors of GPT-4 and human labels, respectively.

For each model and aspect, evaluation involves generation from malicious prompts, frame sampling (typically 1/minute for up to 60 seconds), and binary classification.

4. Key Findings and Quantitative Results

Experiments across four representative T2V models (Pika, Gen2, Stable Video Diffusion, Open-Sora) reveal that:

  • No single model excels across all safety axes. For example, Pika achieves high generation fidelity but correspondingly high NSFW rates in subtle categories, while Open-Sora is less capable of nuanced harms but also less generative overall.
  • Certain risk aspects such as Public Figures (≥97% NSFW), Violence, and Illegal Activities consistently yield high NSFW rates across models.
  • Temporal risk is accurately surfaced only by models with strong motion representation (e.g., Pika; 75–85% NSFW detection), highlighting the importance of sequence-level safety detection.
  • Strong correlation between GPT-4-based and manual ratings (ρ > 0.8 in most aspects), validating LLM assessment as a scalable substitute for costly expert review.
  • A robust usability–safety trade-off is observed: higher generative capacity increases exposure to subtle, policy-violating outputs.

A representative excerpt from Table 2 clarifies variability:

Aspect Pika (GPT-4 / Human) Gen2 SVD Open-Sora ρ
Pornography 23.0% / 31.1% 0% / 0% 0% / 1.6% 45.9% / 45.9% 0.844
Borderline Pornography 55.0% / 50.0% 30% / 25% 0% / 5% 20% / 25% 0.871
Violence 55.0% / 65.0% 60% / 50% 55% / 55% 95% / 95% 0.832
Temporal Risk 75.0% / 85.0% 5% / 0% 0% / 0% 5% / 5% 0.887
NSFW Average 52.9% / 57.8% 37.6% / 38.7% 43.1% / 44.1% 52.2% / 51.2% 0.826

5. Extensions to Multimodal Safety: T2VSafetyBench-TI2V

To address the advent of text-and-image-to-video (TI2V) models, T2VSafetyBench has been systematically extended—yielding T2VSafetyBench-TI2V (Ma et al., 24 Nov 2025)—to test compositional safety risks arising from the fusion of text and image inputs. The TI2V benchmark expands 695 original text prompts into 2,085 test cases by generating both unsafe and safe image-text combinations, thus enabling the evaluation of "visual jailbreak" phenomena.

Scenarios include:

  • Unsafe Image + Unsafe Text (IU_U+TU_U)
  • Safe Image + Unsafe Text (IS_S+TU_U)
  • Unsafe Image + Safe Text (IU_U+TNSFWmavg=112k=112NSFWm,k\mathrm{NSFW}_m^\mathrm{avg} = \frac{1}{12}\sum_{k=1}^{12}\mathrm{NSFW}_{m,k}0)

The focus is on zero-shot, binary-risk detection—requiring models to aggregate multi-modal signals for accurate safety assessment. The primary metric is detection accuracy (e.g., ConceptGuard achieves 0.960, outperforming Qwen2.5-VL-72B at 0.882).

6. Recommendations, Limitations, and Future Directions

T2VSafetyBench establishes several best practices and open challenges for the community:

  • Multidimensional post-generation detection (not just static classifiers) should be integrated to capture emergent risks, particularly those involving public figure impersonation, misinformation, and sequential/temporal content.
  • Safety objectives must be introduced at model training time, either via curriculum, adversarial training, or explicit safety constraints to pre-empt generation of risky outputs.
  • There is a demonstrable need for specialized, temporally aware classifiers to systematically analyze frame-to-frame transitions—essential for robust temporal-risk identification.
  • The detection of abstract or subjective harms (e.g., disturbing content) remains limited by current model affective and cultural understanding, indicating the need for improved cross-population annotation regimes and possibly affective model pre-training.
  • The arms race with jailbreak methods is ongoing; benchmarks such as T2VSafetyBench must evolve to include increasingly sophisticated adversarial attack strategies.

A plausible implication is that as T2V capabilities expand (multi-perspective, multi-modal, and unconstrained durations), safety benchmarks must incorporate richer modalities and scenario complexity to remain relevant (Miao et al., 2024, Ma et al., 24 Nov 2025).

7. Significance and Impact

T2VSafetyBench, encompassing both its original and TI2V variants, functions as the definitive reference suite for both academic and industrial efforts in generative video model safety. Its adoption enables granular, reproducible measurement of model risks, drives architectural improvements, and informs policy-setting for the deployment of generative multimedia technologies. The benchmark’s structured taxonomy, adversarial prompt coverage, and validated LLM-based automatic assessment pipeline together set a new standard for red-teaming and safeguarding in generative AI video research (Miao et al., 2024, Ma et al., 24 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to T2VSafetyBench.