GuardT2I: Defending Text-to-Image Models from Adversarial Prompts (2403.01446v2)
Abstract: Recent advancements in Text-to-Image (T2I) models have raised significant safety concerns about their potential misuse for generating inappropriate or Not-Safe-For-Work (NSFW) contents, despite existing countermeasures such as NSFW classifiers or model fine-tuning for inappropriate concept removal. Addressing this challenge, our study unveils GuardT2I, a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts. Instead of making a binary classification, GuardT2I utilizes a LLM to conditionally transform text guidance embeddings within the T2I models into natural language for effective adversarial prompt detection, without compromising the models' inherent performance. Our extensive experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator by a significant margin across diverse adversarial scenarios. Our framework is available at https://github.com/cure-lab/GuardT2I.
- Lexica. https://lexica.art/, 2023.
- Stable Diffusion V1.5 checkpoint. https://huggingface.co/runwayml/stable-diffusion-v1-5?text=chi+venezuela+drogenius, 2023.
- Microsoft Azure Content Moderator. https://learn.microsoft.com/zh-cn/azure/ai-services/content-moderator/api-reference, 2024.
- AWS Comprehend. https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html, 2024.
- SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution. arXiv preprint arXiv:2309.14122, 2023.
- Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations, 2015.
- Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 2015.
- Improving image generation with better captions. https://cdn.openai.com/papers/dall-e-3.pdf, 2023.
- Chicco, D. Siamese neural networks: An overview. Artificial neural networks, 2021.
- Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1070.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186, 2019.
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. 2022.
- Beam search strategies for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pp. 56–60, 2017.
- Erasing Concepts from Diffusion Models. arXiv preprint arXiv:2303.07345, 2023.
- Detoxify. https://github.com/unitaryai/detoxify, 2020.
- A Twofold Siamese Network for Real-Time Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4834–4843, 2018.
- HuggingFace. Safety Checker nested in Stable Diffusion. https://huggingface.co/CompVis/stable-diffusion-safety-checker, 2023.
- A survey on contrastive self-supervised learning. Technologies, 2020.
- Siamese neural networks for one-shot image recognition. In Proceedings of the International Conference on Machine Learning deep learning workshop, 2015.
- Ablating Concepts in Text-to-Image Diffusion Models. arXiv preprint arXiv:2303.13516, 2023.
- Holistic Evaluation of Text-To-Image Models. arXiv preprint arXiv:2311.04287, 2023.
- Leonardo.Ai. Leonardo.Ai. https://leonardo.ai/, 2023.
- Mixed Cross Entropy Loss for Neural Machine Translation. In Proceedings of the International Conference on Machine Learning, pp. 6425–6436, 2021.
- A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 15009–15018, 2023.
- Learned in translation: Contextualized word vectors. Advances in neural information processing systems, 30, 2017.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
- Michellejieli. NSFW text classifier. https://huggingface.co/michellejieli/NSFW_text_classifier, 2023.
- Midjounery. Midjourney. https://midjourney.com/, 2023.
- Pytorch metric learning, 2020.
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Proceedings of the International Conference on Machine Learning, pp. 16784–16804, 2022.
- OpenAI. Moderation overview. https://platform.openai.com/docs/guides/moderation/overview, 2023.
- PlaygroundAI. Playground. https://playgroundai.com/, 2023.
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models. arXiv preprint arXiv:2305.13873, 2023.
- Improving language understanding by generative pre-training. OpenAI.
- Language models are unsupervised multitask learners. OpenAI, 2019.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, pp. 8748–8763, 2021.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125, 2022.
- Red-Teaming the Stable Diffusion Safety Filter. arXiv preprint arXiv:2210.04610, 2022a.
- Red-Teaming the Stable Diffusion Safety Filter. arXiv preprint arXiv:2210.04610, 2022b.
- On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2019.
- Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020.
- High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10674–10685, 2022.
- Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 8:264–280, 2020.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022a.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems, 2022b.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22522–22531, 2023.
- LAION-5B: An Open Large-scale Dataset for Training Next Generation Image-text Models. In Proceedings of the Advances in Neural Information Processing Systems, 2022a.
- LAION-COCO. https://laion.ai/blog/laion-coco/, 2022b.
- Augmented sbert: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021a.
- Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. In ”Proceedings of the Advances in Neural Information Processing Systems”, 2021b.
- Llama: open and efficient foundation language models. 2023a.
- Llama 2: Open foundation and fine-tuned chat models. 2023b.
- Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? arXiv preprint arXiv:2310.10012, 2023.
- Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
- Tsdae: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2021.
- A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1989.
- MMA-Diffusion: MultiModal Attack on Diffusion Models. arXiv preprint arXiv:2311.17516, 2023.
- Sneakyprompt: Jailbreaking text-to-image generative models. In Proceedings of the IEEE Symposium on Security and Privacy, 2024.
- To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images… for now. arXiv preprint arXiv:2310.11868, 2023a.
- To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images… for now. arXiv preprint arXiv:2310.11868, 2023b.
- Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019.
- Yijun Yang (46 papers)
- Ruiyuan Gao (18 papers)
- Xiao Yang (158 papers)
- Jianyuan Zhong (13 papers)
- Qiang Xu (129 papers)