Groot: Adversarial Testing for Generative Text-to-Image Models with Tree-based Semantic Transformation (2402.12100v1)
Abstract: With the prevalence of text-to-image generative models, their safety becomes a critical concern. adversarial testing techniques have been developed to probe whether such models can be prompted to produce Not-Safe-For-Work (NSFW) content. However, existing solutions face several challenges, including low success rate and inefficiency. We introduce Groot, the first automated framework leveraging tree-based semantic transformation for adversarial testing of text-to-image models. Groot employs semantic decomposition and sensitive element drowning strategies in conjunction with LLMs to systematically refine adversarial prompts. Our comprehensive evaluation confirms the efficacy of Groot, which not only exceeds the performance of current state-of-the-art approaches but also achieves a remarkable success rate (93.66%) on leading text-to-image models such as DALL-E 3 and Midjourney.
- 2023. ChatGPT. https://chat.openai.com/. (2023).
- 2023. A Comprehensive Overview of Large Language Models. arXiv (2023). https://arxiv.org/abs/2307.06435.
- 2023. Content policy | DALL·E. https://labs.openai.com/policies/content-policy. (2023).
- 2023. DALL·E 3. https://openai.com/dall-e-3. (2023).
- 2023. Groot. https://sites.google.com/view/text-to-image-testing. (2023).
- 2023. Midjourney. https://www.midjourney.com/. (2023).
- 2023. Midjourney. https://www.midjourney.com/home. (2023).
- 2023. Reddit - Dive into anything. https://www.reddit.com/r/ChatGPT/comments/11vlp7j/nsfwgpt_that_nsfw_prompt/. (2023).
- 2023. Shader - Wikipedia. https://en.wikipedia.org/wiki/Shader. (2023).
- 2023. Stable Diffusion — Stability AI. https://stability.ai/stable-diffusion. (2023).
- 2023. Vertex AI. https://cloud.google.com/vertex-ai. (2023).
- MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots. (2023). arXiv:cs.CR/2307.08715
- Siddhant Garg and Goutham Ramakrishnan. 2020a. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 6174–6181. https://doi.org/10.18653/v1/2020.emnlp-main.498
- Siddhant Garg and Goutham Ramakrishnan. 2020b. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 6174–6181. https://doi.org/10.18653/v1/2020.emnlp-main.498
- Ming Jiang and Jana Diesner. 2019. A Constituency Parsing Tree based Method for Relation Extraction from. EMNLP-IJCNLP 2019 (2019), 186.
- Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. (2020). arXiv:cs.CL/1907.11932
- Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), 8018–8025. https://doi.org/10.1609/aaai.v34i05.6311
- TextBugger: Generating Adversarial Text Against Real-world Applications. In Proceedings 2019 Network and Distributed System Security Symposium (NDSS 2019). Internet Society. https://doi.org/10.14722/ndss.2019.23138
- TextBugger: Generating Adversarial Text Against Real-world Applications. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019. The Internet Society. https://www.ndss-symposium.org/ndss-paper/textbugger-generating-adversarial-text-against-real-world-applications/
- Translation with source constituency and dependency trees. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1066–1076.
- Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models. (2023). arXiv:cs.CV/2305.13873
- Red-Teaming the Stable Diffusion Safety Filter. (2022). arXiv:cs.AI/2210.04610
- A Survey on Techniques in NLP. International Journal of Computer Applications 134, 8 (2016), 6–9.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
- SneakyPrompt: Jailbreaking Text-to-image Generative Models. (2023). arXiv:cs.LG/2305.12082
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023).
- JADE: A Linguistics-based Safety Evaluation Platform for LLM. arXiv preprint arXiv:2311.00286 (2023).