Perception-guided Jailbreak against Text-to-Image Models (2408.10848v4)
Abstract: In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695.
- MidJourney, “Midjourney,” https://www.midjourney.com/home, 2022. [Online]. Available: https://www.midjourney.com/home
- OpenAI, “Dalle3,” https://openai.com/index/dall-e-3/, 2023. [Online]. Available: https://openai.com/index/dall-e-3/
- Y. Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y. Zhang, “Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2023, pp. 3403–3417.
- P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 522–22 531.
- Midjourney, “Midjourney banned words policy,” https://openaimaster.com/midjourney-banned-words/, 2023. [Online]. Available: https://openaimaster.com/midjourney-banned-words/
- J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tramèr, “Red-teaming the stable diffusion safety filter,” arXiv preprint arXiv:2210.04610, 2022.
- Y. Yang, B. Hui, H. Yuan, N. Gong, and Y. Cao, “Sneakyprompt: Jailbreaking text-to-image generative models,” in Proceedings of the IEEE Symposium on Security and Privacy, 2024.
- Z. Ba, J. Zhong, J. Lei, P. Cheng, Q. Wang, Z. Qin, Z. Wang, and K. Ren, “Surrogateprompt: Bypassing the safety filter of text-to-image models via substitution,” arXiv preprint arXiv:2309.14122, 2023.
- Y. Yang, R. Gao, X. Wang, T.-Y. Ho, N. Xu, and Q. Xu, “Mma-diffusion: Multimodal attack on diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7737–7746.
- D. Peng, Q. Ke, and J. Liu, “Upam: Unified prompt attack in text-to-image generation models against both textual filters and visual checkers,” arXiv preprint arXiv:2405.11336, 2024.
- J. Ma, A. Cao, Z. Xiao, J. Zhang, C. Ye, and J. Zhao, “Jailbreaking prompt attack: A controllable adversarial attack against diffusion models,” arXiv preprint arXiv:2404.02928, 2024.
- B. Z. Li, M. Nye, and J. Andreas, “Implicit representations of meaning in neural language models,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 1813–1827. [Online]. Available: https://aclanthology.org/2021.acl-long.143
- P. Sharma, T. R. Shaham, M. Baradad, S. Fu, A. Rodriguez-Munoz, S. Duggal, P. Isola, and A. Torralba, “A vision check-up for language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 410–14 419.
- C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-image diffusion models in generative ai: A survey,” arXiv preprint arXiv:2303.07909, 2023.
- E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Generating images from captions with attention,” arXiv preprint arXiv:1511.02793, 2015.
- T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1316–1324.
- F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 850–10 869, 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- OpenAI, “Dalle2,” https://openai.com/index/dall-e-2/, 2021. [Online]. Available: https://openai.com/index/dall-e-2/
- C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information processing systems, vol. 35, pp. 36 479–36 494, 2022.
- Ali, “Tongyiwanxiang,” https://tongyi.aliyun.com/wanxiang/?utm_source=aihub.cn/, 2023. [Online]. Available: https://tongyi.aliyun.com/wanxiang/?utm_source=aihub.cn/
- OpenAI, “Chatgpt,” https://chatgpt.com/, 2022. [Online]. Available: https://chatgpt.com/
- ——, “Gpt4,” https://openai.com/index/gpt-4-research/, 2023. [Online]. Available: https://openai.com/index/gpt-4-research/
- H. Gao, H. Zhang, Y. Dong, and Z. Deng, “Evaluating the robustness of text-to-image diffusion models against real-world attacks,” arXiv preprint arXiv:2306.13103, 2023.
- Z. Kou, S. Pei, Y. Tian, and X. Zhang, “Character as pixels: A controllable prompt adversarial attacking framework for black-box text guided image generation models.” in IJCAI, 2023, pp. 983–990.
- C. Liang, X. Wu, Y. Hua, J. Zhang, Y. Xue, T. Song, Z. Xue, R. Ma, and H. Guan, “Adversarial example does good: Preventing painting imitation from diffusion models via adversarial examples,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, pp. 20 763–20 786. [Online]. Available: https://proceedings.mlr.press/v202/liang23g.html
- H. Zhuang, Y. Zhang, and S. Liu, “A pilot study of query-free adversarial attack against stable diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2385–2392.
- Y. Deng and H. Chen, “Divide-and-conquer attack: Harnessing the power of llm to bypass the censorship of text-to-image generation model,” arXiv preprint arXiv:2312.07130, 2023.
- J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, B. Chen, R. Sun, Y. Wang, and Y. Yang, “Beavertails: Towards improved safety alignment of llm via a human-preference dataset,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- Zhipu, “Cogview3,” https://open.bigmodel.cn/dev/howuse/cogview/, 2024. [Online]. Available: https://open.bigmodel.cn/dev/howuse/cogview/
- D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023.
- Tencent, “Hunyuan,” https://hunyuan.tencent.com/, 2024. [Online]. Available: https://hunyuan.tencent.com/
- J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
- Ali, “Tongyiqianwen,” https://tongyi.aliyun.com/qianwen/, 2023. [Online]. Available: https://tongyi.aliyun.com/qianwen/