- The paper introduces DES, a novel mechanism that distorts unsafe embeddings to align them with safe ones, improving defense against multimodal attacks.
- It details a training algorithm combining alignment and preservation losses to fine-tune text encoders for enhanced diffusion model security.
- Empirical results demonstrate DES's effectiveness in neutralizing adversarial prompts while preserving safe content generation.
Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models
Introduction
The paper introduces a novel defense mechanismāDistortion Embedding Space (DES)ādesigned to address vulnerabilities in text-to-image diffusion models such as Stable Diffusion and DALLĀ·E, which have been misused for generating unsafe content. Traditional methods like prompt filtering struggle against Multimodal Attacks (MMA) that exploit both textual and visual modalities. DES operates directly in the embedding space, offering an innovative approach that surpasses conventional prompt filtering techniques.
Distortion Embedding Space (DES) Mechanism
Unsafe Prompt Neutralization
DES addresses the key challenge of unsafe prompt generation by distorting unsafe embeddings into a controlled and safe embedding space. This process is formalized through an alignment loss mechanism, where unsafe embeddings eu,jā are aligned to the nearest safe embeddings ec,jā. The alignment loss is defined as:
Luā=M1āj=1āMā(1āā„eu,jāā„ā„ec,jāā„eu,jāā
ec,jāā),
where ec,jā is the closest safe embedding found within the set of safe embeddings Csā.
Preservation of Safe Prompts
Simultaneously, DES ensures that safe embeddings maintain their structure and functionality. This is achieved through a preservation loss function that minimizes the distance between original and fine-tuned safe embeddings:
Lsā=N1āi=1āNā(1āā„es,icurrentāā„ā„es,ioriginalāā„es,icurrentāā
es,ioriginalāā).
Adversarial Robustness
The DES framework is engineered to secure the embedding space against adversaries manipulating both the textual and visual input. It notably enhances the system's resilience to white-box attacks by focusing on embedding space transformations.
Training Algorithm
The training process for implementing DES involves fine-tuning a pretrained text encoder foriginalā using a combination of alignment and preservation losses. The algorithm iteratively updates the model to neutralize unsafe embeddings while preserving the integrity of safe ones. The combined loss function is weighted by hyperparameters Ī»uā and Ī»sā to balance between alignment and preservation.
The detailed algorithm is as follows:
- Initialize the current model fcurrentā as a copy of the pretrained model.
- Compute original safe embeddings.
- For each epoch, execute:
- Sample mini-batches of safe and unsafe prompts.
- Compute embeddings for these mini-batches.
- Identify the closest safe embeddings for each unsafe prompt.
- Calculate both alignment and preservation losses.
- Update the model using gradient descent.
The iterative process results in a fine-tuned text encoder that effectively implements the DES methodology.
Conclusion
The introduction of the Distortion Embedding Space presents a significant advancement in defending against adversarial attacks targeting text-to-image diffusion models. By focusing on embedding space distortions rather than superficial filtering strategies, DES offers a robust defense framework that effectively neutralizes unsafe content generation while preserving the utility of safe prompts. The empirical results underline its efficacy against MMAs, suggesting that DES is a promising solution for enhancing the security and robustness of diffusion models against adversarial exploitation. Future work could extend DES to address evolving attack strategies and probe its integration into broader AI safety frameworks.