Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization

Published 19 Apr 2025 in cs.CV | (2504.14290v1)

Abstract: Ensuring the safety of generated content remains a fundamental challenge for Text-to-Image (T2I) generation. Existing studies either fail to guarantee complete safety under potentially harmful concepts or struggle to balance safety with generation quality. To address these issues, we propose Safety-Constrained Direct Preference Optimization (SC-DPO), a novel framework for safety alignment in T2I models. SC-DPO integrates safety constraints into the general human preference calibration, aiming to maximize the likelihood of generating human-preferred samples while minimizing the safety cost of the generated outputs. In SC-DPO, we introduce a safety cost model to accurately quantify harmful levels for images, and train it effectively using the proposed contrastive learning and cost anchoring objectives. To apply SC-DPO for effective T2I safety alignment, we constructed SCP-10K, a safety-constrained preference dataset containing rich harmful concepts, which blends safety-constrained preference pairs under both harmful and clean instructions, further mitigating the trade-off between safety and sample quality. Additionally, we propose a Dynamic Focusing Mechanism (DFM) for SC-DPO, promoting the model's learning of difficult preference pair samples. Extensive experiments demonstrate that SC-DPO outperforms existing methods, effectively defending against various NSFW content while maintaining optimal sample quality and human preference alignment. Additionally, SC-DPO exhibits resilience against adversarial prompts designed to generate harmful content.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces Safety-Constrained Direct Preference Optimization, integrating safety constraints into T2I models to mitigate NSFW content.
It employs a novel combined reward function and a safety cost model using a frozen CLIP encoder and learnable adapter to balance image quality with safety.
Experimental evaluations show lower inappropriate probability scores and robust resistance to adversarial attacks compared to baseline methods.

Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization

Introduction

The increasing sophistication of Text-to-Image (T2I) generative models has been accompanied by a growing concern about the safety and ethics of generated content, especially with respect to NSFW (Not Safe For Work) material. Traditional methods for ensuring safety in T2I generation, such as filtering-based or concept erasure techniques, often suffer from limitations including ineffective handling of adversarially crafted prompts and a detrimental impact on image quality. This paper introduces Safety-Constrained Direct Preference Optimization (SC-DPO), a novel approach that incorporates safety cost constraints into human preference alignment, aiming to generate human-preferred images while minimizing safety risks.

Figure 1: Illustration of Core Motivation. The proposed SC-DPO aims to incorporate safety cost constraints into human preference alignment, encouraging the diffusion model to maximize the likelihood of generating human-preferred samples while minimizing the safety cost of the generated outputs, thereby enhancing the safety alignment capability.

Safety-Constrained Direct Preference Optimization

Framework and Objectives

SC-DPO builds upon the Direct Preference Optimization (DPO) paradigm traditionally used in LLMs and adapts it to T2I diffusion models. The central idea is to integrate safety-related constraints into the broader human preference alignment process. A combined reward function is leveraged, defined as:

$R_{\lambda}(T,I) = R(T,I) - \lambda \cdot C(I)$

where $R(T,I)$ is a preference reward function capturing semantic alignment and image quality, while $C(I)$ is the safety cost function assessing harmfulness. The parameter $\lambda$ controls the trade-off between safety and generation quality.

Safety Cost Model

A critical component of SC-DPO is the Safety Cost Model, which quantifies the harmfulness of images. This model is trained using a combination of contrastive learning and cost anchoring objectives. The model incorporates a pre-trained image encoder with a learnable adapter layer to predict safety costs for image inputs.

Figure 2: Structure and Training of the Safety Cost Model. We use a frozen CLIP (ViT-H/14) visual encoder and a learnable adapter to predict the safety cost of the samples.

Dataset Construction

To support the training of SC-DPO, a novel dataset, SCP-10K, is constructed. This dataset comprises safety-constrained preference pairs that highlight a variety of harmful concepts. The data creation process involves leveraging predefined harmful terms to identify and filter harmful prompts from DiffusionDB, followed by a safety-aware inpainting process for generating content-consistent safe images.

Figure 3: Creation Process of Safety-Constrained Preference Pairs. We use predefined harmful concepts to filter harmful prompts from DiffusionDB and generate harmful images with SDXL.

Experimental Evaluation

Safety Alignment Performance

The SC-DPO framework was evaluated against existing methods across benchmarks for NSFW content mitigation. Remarkably, SC-DPO achieved lower inappropriate probability (IP) scores across multiple datasets, including I2P-Sexual and NSFW-56K, compared to state-of-the-art baselines such as UCE and SafetyDPO. SC-DPO effectively minimized the generation of sensitive body parts and sustained superior safety alignment performance across various NSFW content categories.

Figure 4: Visualization of Generated Images under Different Harmful Prompts. We compared SC-DPO with the original SD-v1.5, UCE, and SafetyDPO, demonstrating SC-DPO's superior safety alignment performance.

Resistance to Adversarial Attacks

Additionally, SC-DPO showed resilience against state-of-the-art adversarial attacks like SneakyPrompt and MMA, maintaining a low inappropriate content probability compared to other methods. This robustness underscores SC-DPO's effectiveness in adversarial scenarios.

Figure 5: Resistance Against Adversarial Attacks. We employed two advanced text-based T2I adversarial attacks, SneakyPrompt and MMA, and reported the inappropriate content probability under these attack methods.

Conclusion

The paper presents SC-DPO as an innovative scalable framework that successfully integrates safety constraints into human preference optimization for T2I models, achieving a robust balance between safety and generation quality. By introducing effective safety alignment in challenging scenarios and extending resistance to adversarial conditions, SC-DPO represents a significant advancement in ethical AI deployment in creative systems. Future work may focus on applying SC-DPO across a broader range of model architectures and enhancing support for diverse harmful concept categories.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Sign Up to Generate All Videos Subscribe on YouTube

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Sign Up to Generate

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Top Community Prompts

Explain it Like I'm 14

Practical Applications

Conceptual Simplification

Sign Up to Activate View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

Collections

Sign up for free to add this paper to one or more collections.