Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models

Published 31 Jan 2025 in cs.CV, cs.CR, and cs.LG | (2501.18877v1)

Abstract: Text-to-image diffusion models show remarkable generation performance following text prompts, but risk generating Not Safe For Work (NSFW) contents from unsafe prompts. Existing approaches, such as prompt filtering or concept unlearning, fail to defend against adversarial attacks while maintaining benign image quality. In this paper, we propose a novel approach called Distorting Embedding Space (DES), a text encoder-based defense mechanism that effectively tackles these issues through innovative embedding space control. DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions to prevent unsafe contents generation, while reproducing the original safe embeddings. DES also neutralizes the nudity embedding, extracted using prompt ``nudity", by aligning it with neutral embedding to enhance robustness against adversarial attacks. These methods ensure both robust defense and high-quality image generation. Additionally, DES can be adopted in a plug-and-play manner and requires zero inference overhead, facilitating its deployment. Extensive experiments on diverse attack types, including black-box and white-box scenarios, demonstrate DES's state-of-the-art performance in both defense capability and benign image generation quality. Our model is available at https://github.com/aei13/DES.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces DES, a novel mechanism that distorts unsafe embeddings to align them with safe ones, improving defense against multimodal attacks.
It details a training algorithm combining alignment and preservation losses to fine-tune text encoders for enhanced diffusion model security.
Empirical results demonstrate DES's effectiveness in neutralizing adversarial prompts while preserving safe content generation.

Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models

Introduction

The paper introduces a novel defense mechanism—Distortion Embedding Space (DES)—designed to address vulnerabilities in text-to-image diffusion models such as Stable Diffusion and DALL·E, which have been misused for generating unsafe content. Traditional methods like prompt filtering struggle against Multimodal Attacks (MMA) that exploit both textual and visual modalities. DES operates directly in the embedding space, offering an innovative approach that surpasses conventional prompt filtering techniques.

Distortion Embedding Space (DES) Mechanism

Unsafe Prompt Neutralization

DES addresses the key challenge of unsafe prompt generation by distorting unsafe embeddings into a controlled and safe embedding space. This process is formalized through an alignment loss mechanism, where unsafe embeddings $e_{u, j}$ are aligned to the nearest safe embeddings $e_{c, j}$ . The alignment loss is defined as:

$\mathcal{L}_u = \frac{1}{M} \sum_{j=1}^M \left( 1 - \frac{e_{u, j} \cdot e_{c, j}}{\|e_{u, j}\| \|e_{c, j}\|} \right),$

where $e_{c, j}$ is the closest safe embedding found within the set of safe embeddings $\mathcal{C}_s$ .

Preservation of Safe Prompts

Simultaneously, DES ensures that safe embeddings maintain their structure and functionality. This is achieved through a preservation loss function that minimizes the distance between original and fine-tuned safe embeddings:

$\mathcal{L}_s = \frac{1}{N} \sum_{i=1}^N \left( 1 - \frac{e_{s, i}^{\text{current}} \cdot e_{s, i}^{\text{original}}}{\|e_{s, i}^{\text{current}}\| \|e_{s, i}^{\text{original}}\|} \right).$

Adversarial Robustness

The DES framework is engineered to secure the embedding space against adversaries manipulating both the textual and visual input. It notably enhances the system's resilience to white-box attacks by focusing on embedding space transformations.

Training Algorithm

The training process for implementing DES involves fine-tuning a pretrained text encoder $f_{\text{original}}$ using a combination of alignment and preservation losses. The algorithm iteratively updates the model to neutralize unsafe embeddings while preserving the integrity of safe ones. The combined loss function is weighted by hyperparameters $\lambda_u$ and $\lambda_s$ to balance between alignment and preservation.

The detailed algorithm is as follows:

Initialize the current model $f_{\text{current}}$ as a copy of the pretrained model.
Compute original safe embeddings.
For each epoch, execute:
- Sample mini-batches of safe and unsafe prompts.
- Compute embeddings for these mini-batches.
- Identify the closest safe embeddings for each unsafe prompt.
- Calculate both alignment and preservation losses.
- Update the model using gradient descent.

The iterative process results in a fine-tuned text encoder that effectively implements the DES methodology.

Conclusion

The introduction of the Distortion Embedding Space presents a significant advancement in defending against adversarial attacks targeting text-to-image diffusion models. By focusing on embedding space distortions rather than superficial filtering strategies, DES offers a robust defense framework that effectively neutralizes unsafe content generation while preserving the utility of safe prompts. The empirical results underline its efficacy against MMAs, suggesting that DES is a promising solution for enhancing the security and robustness of diffusion models against adversarial exploitation. Future work could extend DES to address evolving attack strategies and probe its integration into broader AI safety frameworks.