SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation (2410.12761v1)

Published 16 Oct 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they have also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges: (1) They cannot instantly remove harmful concepts without training. (2) Their safe generation capabilities depend on collected training data. (3) They alter model weights, risking degradation in quality for content unrelated to toxic concepts. To address these, we propose SAFREE, a novel, training-free approach for safe T2I and T2V, that does not alter the model's weights. Specifically, we detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace, thereby filtering out harmful content while preserving intended semantics. To balance the trade-off between filtering toxicity and preserving safe concepts, SAFREE incorporates a novel self-validating filtering mechanism that dynamically adjusts the denoising steps when applying the filtered embeddings. Additionally, we incorporate adaptive re-attention mechanisms within the diffusion latent space to selectively diminish the influence of features related to toxic concepts at the pixel level. In the end, SAFREE ensures coherent safety checking, preserving the fidelity, quality, and safety of the output. SAFREE achieves SOTA performance in suppressing unsafe content in T2I generation compared to training-free baselines and effectively filters targeted concepts while maintaining high-quality images. It also shows competitive results against training-based methods. We extend SAFREE to various T2I backbones and T2V tasks, showcasing its flexibility and generalization. SAFREE provides a robust and adaptable safeguard for ensuring safe visual generation.

PDF HTML Abstract

SAFREE: Training-Free Adaptive Measures for Safe Text-to-Image and Video Generation

The paper introduces SAFREE, a novel method designed to enhance the safety of text-to-image (T2I) and text-to-video (T2V) generation models without modifying the underlying model weights or requiring additional training. The core contribution of SAFREE is an innovative framework that filters unsafe content by operating within the text embedding and visual latent spaces. The following essay explores the methodology, empirical results, and implications of this approach.

Methodology Overview

SAFREE operates through a series of steps targeting unsafe concept removal:

Identification of Toxic Subspace: The method begins by identifying a subspace in the text embedding that corresponds to undesired or toxic concepts. This is achieved by calculating embedding projections to pinpoint tokens likely to invoke such concepts.
Adaptive Token Filtering: Once relevant tokens are detected, SAFREE employs orthogonal projection to transform these tokens' embeddings into a safer space. The key is to maintain them within the original embedding space to preserve semantic integrity.
Self-Validating Filtering: This mechanism dynamically adjusts the denoising steps, allowing SAFREE to contextually modulate its filtering strength according to the input. This ensures a balance between concept removal and content fidelity.
Latent Re-Attention: By integrating an adaptive mechanism in the latent space of the diffusion models, SAFREE can further filter content at the regional pixel level. This process is executed in the frequency domain, considering the influence of low-frequency features to maintain quality.

The SAFREE framework's flexibility allows it to be applied across various models and tasks, from advanced T2I models like SDXL to T2V models such as ZeroScopeT2V and CogVideoX.

Empirical Evaluation

SAFREE demonstrates impressive performance across several benchmarks:

Attack Success Rates: The method consistently achieved lower ASR in comparison to existing training-free methods, significantly reducing the generation of unsafe content from adversarial prompts in datasets like I2P and P4D.
Artist-Style Removal: SAFREE effectively removes specific artist styles without compromising other artistic features, indicative of its precision and preservation capabilities in content generation.
Content Quality: Despite the filtering, SAFREE maintains high content quality, comparable to models without safe generation constraints, as measured by metrics like FID and CLIP.

Implications and Speculations

The implications of SAFREE extend across both practical and theoretical domains. Practically, it offers an adaptable, efficient safeguard for T2I and T2V models, which is crucial as the deployment of generative AI expands into sensitive or public-facing applications. Theoretically, it underscores the potential of leveraging text embedding adjustments to manipulate and control model outputs without extensive retraining or model modification.

Looking forward, research might focus on enhancing the identification of implicit toxic content or extending the adaptability under varying input complexities. SAFREE’s approach to filtering based on orthogonal projections within embedding spaces could also inform developments in AI interpretability and control, providing insight into how models internally represent and process complex input semantics.

In conclusion, SAFREE presents an innovative pathway for ensuring the safety of generative AI systems, offering a training-free, dynamically adaptive solution that does not compromise the integrity or quality of the user-defined output. As generative models continue to develop, such measures will be integral in safeguarding their application within ethical and societal frameworks.