Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models (2312.07130v4)

Published 12 Dec 2023 in cs.AI

Abstract: To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack, respectively. We open-source our code and dataset on this link.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. “DALL-E 3.” [Online]. Available: https://openai.com/dall-e-3
  2. “Midjourney.” [Online]. Available: https://www.midjourney.com/
  3. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv e-prints, pp. arXiv–2204, 2022.
  4. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  5. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  6. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  7. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  8. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  9. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  10. R. OpenAI, “GPT-4 technical report,” arXiv, pp. 2303–08 774, 2023.
  11. “DALL-E 2.” [Online]. Available: https://openai.com/dall-e-2
  12. “The Rise of Ethical Concerns about AI Content Creation: A Call to Action.” [Online]. Available: https://www.computer.org/publications/tech-news/trends/ethical-concerns-on-ai-content-creation
  13. H. Bansal, D. Yin, M. Monajatipoor, and K.-W. Chang, “How well can text-to-image generative models understand ethical natural language interventions?” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1358–1370.
  14. P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017.
  15. “Midjourney’s banned words policy.” [Online]. Available: https://openaimaster.com/midjourney-banned-words/
  16. “DALL·E 3 system card.” [Online]. Available: https://openai.com/research/dall-e-3-system-card
  17. J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tramer, “Red-teaming the stable diffusion safety filter,” in NeurIPS ML Safety Workshop, 2022.
  18. Y. Yang, B. Hui, H. Yuan, N. Gong, and Y. Cao, “Sneakyprompt: Evaluating robustness of text-to-image generative models’ safety filters,” arXiv preprint arXiv:2305.12082, 2023.
  19. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8748–8763.
  20. “OpenAI API Pricing.” [Online]. Available: https://openai.com/pricing
  21. “ChatGLM API Pricing.” [Online]. Available: https://open.bigmodel.cn/dev/api#product-billing
  22. “TongYiQianWen 14B API Pricing.” [Online]. Available: https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-qianwen-7b-14b-72b-metering-and-billing?spm=a2c4g.11186623.0.0.693c502fDRJtKO
  23. “TongYiQianWen Max API Pricing.” [Online]. Available: https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-thousand-questions-metering-and-billing?spm=a2c4g.11186623.0.i1
  24. “Spark API Pricing.” [Online]. Available: https://www.xfyun.cn/doc/spark/Web.html
  25. R. Gozalo-Brizuela and E. C. Garrido-Merchán, “A survey of generative ai applications,” arXiv preprint arXiv:2306.02781, 2023.
  26. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022.
  27. “OpenAI ChatGPT.” [Online]. Available: https://chat.openai.com/
  28. “GPT-4.” [Online]. Available: https://openai.com/research/gpt-4
  29. I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
  30. N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE Symposium on Security and Privacy (SP).   IEEE, 2017, pp. 39–57.
  31. A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in Artificial Intelligence Safety and Security.   Chapman and Hall/CRC, 2018, pp. 99–112.
  32. X. Han, Y. Hu, L. Foschini, L. Chinitz, L. Jankelson, and R. Ranganath, “Deep learning models for electrocardiograms are susceptible to adversarial attack,” Nature medicine, vol. 26, no. 3, pp. 360–363, 2020.
  33. H. Chen, C. Huang, Q. Huang, Q. Zhang, and W. Wang, “Ecgadv: Generating adversarial electrocardiogram to misguide arrhythmia classification system,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 3446–3453.
  34. J. Li, S. Ji, T. Du, B. Li, and T. Wang, “Textbugger: Generating adversarial text against real-world applications,” arXiv preprint arXiv:1812.05271, 2018.
  35. D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 8018–8025.
  36. S. Garg and G. Ramakrishnan, “Bae: Bert-based adversarial examples for text classification,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6174–6181.
  37. R. Millière, “Adversarial attacks on image generation with made-up words,” arXiv preprint arXiv:2208.04135, 2022.
  38. N. Maus, P. Chao, E. Wong, and J. Gardner, “Adversarial prompting for black box foundation models,” arXiv preprint arXiv:2302.04237, 2023.
  39. Y. Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y. Zhang, “Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,” arXiv preprint arXiv:2305.13873, 2023.
  40. D. Ganguli, A. Askell, N. Schiefer, T. Liao, K. Lukošiūtė, A. Chen, A. Goldie, A. Mirhoseini, C. Olsson, D. Hernandez et al., “The capacity for moral self-correction in large language models,” arXiv e-prints, pp. arXiv–2302, 2023.
  41. T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng, “A holistic approach to undesired content detection in the real world,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 12, 2023, pp. 15 009–15 018.
  42. “CLIP Repo.” [Online]. Available: https://github.com/openai/CLIP
  43. “SuperCLUE.” [Online]. Available: https://www.superclueai.com/
  44. “Chatbot Arena.” [Online]. Available: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
  45. “OpenCompass.” [Online]. Available: https://opencompass.org.cn/leaderboard-llm
  46. E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3419–3448.

Summary

  • The paper introduces the Divide-and-Conquer Attack (DACA) that splits sensitive prompts into harmless parts to bypass safety filters.
  • The study demonstrates that LLMs like GPT-4 can effectively rephrase and recombine prompts to generate images retaining sensitive intent.
  • The work underscores significant ethical challenges and the need for enhanced red teaming to fortify generative model defenses against adversarial attacks.

Overview of the Divide-and-Conquer Attack

This paper introduces an approach called the Divide-and-Conquer Attack (DACA) designed to overcome the safety filters of advanced text-to-image generative models like DALL·E 3. The method cleverly utilizes LLMs to transform sensitive text prompts into seemingly innocuous versions that can nevertheless lead to the generation of images with sensitive content. This is done by breaking down the sensitive prompts into discrete, non-threatening components, which the LLM then reassembles into new prompts capable of slipping past the safety mechanisms.

Strategy Behind the Attack

DACA operates by guiding existing LLMs to strategically interpret and rephrase sensitive content. In doing so, these transformed prompts can bypass safety filters, which are binary classifiers designed to prevent the generation of images based on sensitive prompts. The Divide-and-Conquer approach applies a two-fold operation. First, the LLM decomposes a sensitive image into harmless components using helper prompts, and then, it reassembles these components into a complete prompt that evades the safety filters but retains the sensitive information when the components are combined in the generated image.

Execution and Effectiveness

The effectiveness of the Divide-and-Conquer Attack was tested using various state-of-the-art LLMs as backbones, with GPT-4 displaying the highest success rate in bypassing the safety filters. When evaluated in both one-time and re-use scenarios, adversarial prompts showed an impressively high bypass rate. Importantly, the images generated during these attacks maintained a high level of semantic similarity to the original sensitive intent that the safety filters aim to block, challenging the robustness of current safety measures within such generative systems.

Implications and Ethical Considerations

This attack highlights the paradoxical situation where LLMs can subvert the very security measures they are employed to reinforce. The researchers point out the potential security implications and underline the necessity of more attention to the iterative dance between system attacks and defenses. The paper concludes by suggesting this strategy could double as a red teaming tool to rapidly identify vulnerabilities in text-to-image models, which is crucial for aligning AI outputs with human ethical standards. The code and data for the Divide-and-Conquer Attack is available publicly to support the research community in designing robust defense mechanisms against such adversarial strategies.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com