Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models (2404.02928v4)

Published 2 Apr 2024 in cs.CR and cs.AI

Abstract: Text-to-image (T2I) models can be maliciously used to generate harmful content such as sexually explicit, unfaithful, and misleading or Not-Safe-for-Work (NSFW) images. Previous attacks largely depend on the availability of the diffusion model or involve a lengthy optimization process. In this work, we investigate a more practical and universal attack that does not require the presence of a target model and demonstrate that the high-dimensional text embedding space inherently contains NSFW concepts that can be exploited to generate harmful images. We present the Jailbreaking Prompt Attack (JPA). JPA first searches for the target malicious concepts in the text embedding space using a group of antonyms generated by ChatGPT. Subsequently, a prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space. We further introduce a soft assignment with gradient masking technique that allows us to perform gradient ascent in the discrete vocabulary space. We perform extensive experiments with open-sourced T2I models, e.g. stable-diffusion-v1-4 and closed-sourced online services, e.g. DALLE2, Midjourney with black-box safety checkers. Results show that (1) JPA bypasses both text and image safety checkers (2) while preserving high semantic alignment with the target prompt. (3) JPA demonstrates a much faster speed than previous methods and can be executed in a fully automated manner. These merits render it a valuable tool for robustness evaluation in future text-to-image generation research.

References (2)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces Jailbreaking Prompt Attack, a method leveraging semantic-guided prompt pairs to breach text-to-image model defenses.
It employs a black-box framework with semantic loss optimization to generate adversarial NSFW images while maintaining high text-image relevance.
The findings reveal significantly improved attack success and relevance scores, underscoring critical vulnerabilities in current generative AI defenses.

Understanding the Jailbreaking Prompt Attack: A New Adversarial Framework for T2I Models

Introduction to Jailbreaking Prompt Attack (JPA)

The paper introduces Jailbreaking Prompt Attack (JPA), a sophisticated adversarial attack method targeting Text-to-Image (T2I) models, particularly those with implemented security defense mechanisms. This new methodology showcases the ability to generate semantically-rich Not-Safe-for-Work (NSFW) images, thereby exposing the vulnerabilities within current T2I defense mechanisms. Leveraging the Classifier-Free Guidance (CFG) feature commonly found in these models, JPA manipulates the CLIP embedding space to conduct attacks without requiring any post-processing efforts, setting it apart from existing attack strategies.

Core Methodology

JPA leverages a black-box attack framework characterized by the following:

Prompt Pairs and Semantic Guidance: By utilizing prompt pairs representing negative (unsafe) and positive (safe) concepts and calculating their embeddings' average difference, JPA can guide the generation process towards producing content that closely aligns with the input prompt but contains potentially unsafe elements.
Semantic Loss Optimization: The framework optimizes the generation of adversarial prompts by maintaining semantic similarity, ensuring that the generated images remain closely related to the original intent of the input prompt.
Sensitive-word Exclusion: JPA incorporates a mechanism to exclude sensitive words, thus bypassing prompt filters typically employed by defense mechanisms in T2I models.

Experimentation and Findings

The validation of JPA involved extensive experiments across various T2I models with defense mechanisms, demonstrating significantly higher Attack Success Rates (ASR) and Text-Image Relevance Rate (TRR) compared to other methods. Importantly, JPA's ability to maintain high TRR scores while achieving successful adversarial attacks highlights its ability to generate images that not only bypass defense mechanisms but also retain a high degree of relevance to the original prompts. Furthermore, JPA's flexibility in conducting directed attacks towards specific concepts, such as "African" or "zombie," showcases its adaptability in exploiting vulnerabilities within T2I models.

Implications and Future Directions

JPA's findings underscore the inherent vulnerabilities within the text processing capabilities of T2I models, suggesting that the effectiveness of defense mechanisms may be compromised without addressing the textual space's security. The use of prompt pairs and semantic guidance within JPA proposes a novel way of understanding and exploiting the implicit text-to-image mapping relationships, offering a potential pathway for developing more robust defense strategies. Moving forward, it is crucial for research to further explore how text-guided generative models can be fortified against such ingenious adversarial attacks.

Conclusion

The introduction of Jailbreaking Prompt Attack presents a significant shift in how adversarial attacks are conducted on T2I models, offering a new perspective on exploiting the vulnerabilities of defense mechanisms through the textual space. By revealing the limitations of current defense strategies, JPA provides valuable insights into advancing the security and reliability of generative AI technologies.

PDF Markdown