Harnessing LLM to Attack LLM-Guarded Text-to-Image Models (2312.07130v4)

Published 12 Dec 2023 in cs.AI

Abstract: To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack, respectively. We open-source our code and dataset on this link.

References (46)

Summary

The paper introduces the Divide-and-Conquer Attack (DACA) that splits sensitive prompts into harmless parts to bypass safety filters.
The study demonstrates that LLMs like GPT-4 can effectively rephrase and recombine prompts to generate images retaining sensitive intent.
The work underscores significant ethical challenges and the need for enhanced red teaming to fortify generative model defenses against adversarial attacks.

Overview of the Divide-and-Conquer Attack

This paper introduces an approach called the Divide-and-Conquer Attack (DACA) designed to overcome the safety filters of advanced text-to-image generative models like DALL·E 3. The method cleverly utilizes LLMs to transform sensitive text prompts into seemingly innocuous versions that can nevertheless lead to the generation of images with sensitive content. This is done by breaking down the sensitive prompts into discrete, non-threatening components, which the LLM then reassembles into new prompts capable of slipping past the safety mechanisms.

Strategy Behind the Attack

DACA operates by guiding existing LLMs to strategically interpret and rephrase sensitive content. In doing so, these transformed prompts can bypass safety filters, which are binary classifiers designed to prevent the generation of images based on sensitive prompts. The Divide-and-Conquer approach applies a two-fold operation. First, the LLM decomposes a sensitive image into harmless components using helper prompts, and then, it reassembles these components into a complete prompt that evades the safety filters but retains the sensitive information when the components are combined in the generated image.

Execution and Effectiveness

The effectiveness of the Divide-and-Conquer Attack was tested using various state-of-the-art LLMs as backbones, with GPT-4 displaying the highest success rate in bypassing the safety filters. When evaluated in both one-time and re-use scenarios, adversarial prompts showed an impressively high bypass rate. Importantly, the images generated during these attacks maintained a high level of semantic similarity to the original sensitive intent that the safety filters aim to block, challenging the robustness of current safety measures within such generative systems.

Implications and Ethical Considerations

This attack highlights the paradoxical situation where LLMs can subvert the very security measures they are employed to reinforce. The researchers point out the potential security implications and underline the necessity of more attention to the iterative dance between system attacks and defenses. The paper concludes by suggesting this strategy could double as a red teaming tool to rapidly identify vulnerabilities in text-to-image models, which is crucial for aligning AI outputs with human ethical standards. The code and data for the Divide-and-Conquer Attack is available publicly to support the research community in designing robust defense mechanisms against such adversarial strategies.

PDF Markdown

Related Papers

GitHub

GitHub - researchcode001/Divide-and-Conquer-Attack: Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Mode (12 stars)

Tweets

https://twitter.com/751043109090430976/status/1737011142495760519

Reddit

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model[Research Paper] (9 points, 6 comments)