Harnessing LLM to Attack LLM-Guarded Text-to-Image Models (2312.07130v4)
Abstract: To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack, respectively. We open-source our code and dataset on this link.
- “DALL-E 3.” [Online]. Available: https://openai.com/dall-e-3
- “Midjourney.” [Online]. Available: https://www.midjourney.com/
- A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv e-prints, pp. arXiv–2204, 2022.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
- A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
- R. OpenAI, “GPT-4 technical report,” arXiv, pp. 2303–08 774, 2023.
- “DALL-E 2.” [Online]. Available: https://openai.com/dall-e-2
- “The Rise of Ethical Concerns about AI Content Creation: A Call to Action.” [Online]. Available: https://www.computer.org/publications/tech-news/trends/ethical-concerns-on-ai-content-creation
- H. Bansal, D. Yin, M. Monajatipoor, and K.-W. Chang, “How well can text-to-image generative models understand ethical natural language interventions?” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1358–1370.
- P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017.
- “Midjourney’s banned words policy.” [Online]. Available: https://openaimaster.com/midjourney-banned-words/
- “DALL·E 3 system card.” [Online]. Available: https://openai.com/research/dall-e-3-system-card
- J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tramer, “Red-teaming the stable diffusion safety filter,” in NeurIPS ML Safety Workshop, 2022.
- Y. Yang, B. Hui, H. Yuan, N. Gong, and Y. Cao, “Sneakyprompt: Evaluating robustness of text-to-image generative models’ safety filters,” arXiv preprint arXiv:2305.12082, 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- “OpenAI API Pricing.” [Online]. Available: https://openai.com/pricing
- “ChatGLM API Pricing.” [Online]. Available: https://open.bigmodel.cn/dev/api#product-billing
- “TongYiQianWen 14B API Pricing.” [Online]. Available: https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-qianwen-7b-14b-72b-metering-and-billing?spm=a2c4g.11186623.0.0.693c502fDRJtKO
- “TongYiQianWen Max API Pricing.” [Online]. Available: https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-thousand-questions-metering-and-billing?spm=a2c4g.11186623.0.i1
- “Spark API Pricing.” [Online]. Available: https://www.xfyun.cn/doc/spark/Web.html
- R. Gozalo-Brizuela and E. C. Garrido-Merchán, “A survey of generative ai applications,” arXiv preprint arXiv:2306.02781, 2023.
- C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022.
- “OpenAI ChatGPT.” [Online]. Available: https://chat.openai.com/
- “GPT-4.” [Online]. Available: https://openai.com/research/gpt-4
- I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
- N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 2017, pp. 39–57.
- A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in Artificial Intelligence Safety and Security. Chapman and Hall/CRC, 2018, pp. 99–112.
- X. Han, Y. Hu, L. Foschini, L. Chinitz, L. Jankelson, and R. Ranganath, “Deep learning models for electrocardiograms are susceptible to adversarial attack,” Nature medicine, vol. 26, no. 3, pp. 360–363, 2020.
- H. Chen, C. Huang, Q. Huang, Q. Zhang, and W. Wang, “Ecgadv: Generating adversarial electrocardiogram to misguide arrhythmia classification system,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 3446–3453.
- J. Li, S. Ji, T. Du, B. Li, and T. Wang, “Textbugger: Generating adversarial text against real-world applications,” arXiv preprint arXiv:1812.05271, 2018.
- D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 8018–8025.
- S. Garg and G. Ramakrishnan, “Bae: Bert-based adversarial examples for text classification,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6174–6181.
- R. Millière, “Adversarial attacks on image generation with made-up words,” arXiv preprint arXiv:2208.04135, 2022.
- N. Maus, P. Chao, E. Wong, and J. Gardner, “Adversarial prompting for black box foundation models,” arXiv preprint arXiv:2302.04237, 2023.
- Y. Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y. Zhang, “Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,” arXiv preprint arXiv:2305.13873, 2023.
- D. Ganguli, A. Askell, N. Schiefer, T. Liao, K. Lukošiūtė, A. Chen, A. Goldie, A. Mirhoseini, C. Olsson, D. Hernandez et al., “The capacity for moral self-correction in large language models,” arXiv e-prints, pp. arXiv–2302, 2023.
- T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng, “A holistic approach to undesired content detection in the real world,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 12, 2023, pp. 15 009–15 018.
- “CLIP Repo.” [Online]. Available: https://github.com/openai/CLIP
- “SuperCLUE.” [Online]. Available: https://www.superclueai.com/
- “Chatbot Arena.” [Online]. Available: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
- “OpenCompass.” [Online]. Available: https://opencompass.org.cn/leaderboard-llm
- E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3419–3448.