SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models (2404.06666v3)
Abstract: Text-to-image (T2I) models, such as Stable Diffusion, have exhibited remarkable performance in generating high-quality images from text descriptions in recent years. However, text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexually explicit scenarios. Existing countermeasures mostly focus on filtering inappropriate inputs and outputs, or suppressing improper text embeddings, which can block sexually explicit content (e.g., naked) but may still be vulnerable to adversarial prompts -- inputs that appear innocent but are ill-intended. In this paper, we present SafeGen, a framework to mitigate sexual content generation by text-to-image models in a text-agnostic manner. The key idea is to eliminate explicit visual representations from the model regardless of the text input. In this way, the text-to-image model is resistant to adversarial prompts since such unsafe visual representations are obstructed from within. Extensive experiments conducted on four datasets and large-scale user studies demonstrate SafeGen's effectiveness in mitigating sexually explicit content generation while preserving the high-fidelity of benign images. SafeGen outperforms eight state-of-the-art baseline methods and achieves 99.4% sexual content removal performance. Furthermore, our constructed benchmark of adversarial prompts provides a basis for future development and evaluation of anti-NSFW-generation methods.
- J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” Advances in Neural Information Processing Systems, pp. 6840–6851, 2020.
- J. Song, C. Meng, and S. Ermon, “Denoising Diffusion Implicit Models,” arXiv preprint arXiv: 2010.02502, 2020.
- M. V. . L. G. LMU, “Stable Diffusion v1-4,” https://huggingface.co/CompVis/stable-diffusion-v1-4.
- “Midjourney,” https://www.midjourney.com.
- O. Inc., “DALL-E 2,” https://openai.com/dall-e-2.
- D. Milmo, “AI-created child sexual abuse images ‘threaten to overwhelm internet’,” https://www.theguardian.com/technology/2023/oct/25/ai-created-child-sexual-abuse-images-threaten-overwhelm-internet.
- M. McQueen, “AI Porn Is Here and It’s Dangerous,” https://exoduscry.com/articles/ai-porn.
- T. Hunter, “AI porn is easy to make now. For women, that’s a nightmare.” https://www.washingtonpost.com/technology/2023/02/13/ai-porn-deepfakes-women-consent.
- W. Hunter, “Paedophiles are using AI to create sexual images of celebrities as CHILDREN, report finds,” https://www.dailymail.co.uk/sciencetech/article-12669791/Paedophiles-using-AI-create-sexual-images-celebrities- CHILDREN-report-finds.html.
- M. Li, “NSFW text classifier on Hugging Face,” https://huggingface.co/michellejieli/NSFW_text_classifier.
- M. V. . L. G. LMU, “Safety Checker,” https://huggingface.co/CompVis/stable-diffusion-safety-checker.
- S. AI, “Stable Diffusion v2-1,” https://huggingface.co/stabilityai/stable-diffusion-2-1.
- P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 522–22 531.
- R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing Concepts from Diffusion Models,” arXiv preprint arXiv:2303.07345, 2023.
- Reddit, “Tutorial: How to Remove the Safety Filter in 5 seconds,” Website, 2022, https://www.reddit.com/r/StableDiffusion/comments/wv2nw0/tutorial_how_to_remove_the_safety_filter_in_5/.
- Y. Yang, B. Hui, H. Yuan, N. Gong, and Y. Cao, “SneakyPrompt: Jailbreaking Text-to-image Generative Models,” in IEEE Symposium on Security and Privacy, 2024.
- Y. Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y. Zhang, “Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes from Text-to-Image Models,” arXiv preprint arXiv:2305.13873, 2023.
- A. I. . M. L. L. at TU Darmstadt, “Inaproppriate Image Prompts (I2P),” https://huggingface.co/datasets/AIML-TUDA/i2p.
- J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” in International Conference on Machine Learning. PMLR, 2023.
- https://github.com/LetterLiGo/text-agnostic-governance.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
- D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” arXiv preprint arXiv:1312.6114, 2013.
- P. Dhariwal and A. Nichol, “Diffusion Models Beat GANs on Image Synthesis,” Advances in Neural Information Processing Systems, pp. 8780–8794, 2021.
- B. Kawar, M. Elad, S. Ermon, and J. Song, “Denoising Diffusion Restoration Models,” Advances in Neural Information Processing Systems, pp. 23 593–23 606, 2022.
- S. Gong, M. Li, J. Feng, Z. Wu, and L. Kong, “Diffuseq: Sequence to Sequence Text Generation with Diffusion Models,” arXiv preprint arXiv:2210.08933, 2022.
- J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet et al., “Imagen Video: High Definition Video Generation with Diffusion Models,” arXiv preprint arXiv:2210.02303, 2022.
- Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A Versatile Diffusion Model for Audio Synthesis,” arXiv preprint arXiv:2009.09761, 2020.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” in IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” in International Conference on Machine Learning. PMLR, 2021.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv preprint arXiv:1810.04805, 2018.
- C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5B: An Open Large-scale Dataset for Training Next Generation Image-Text Models,” Advances in Neural Information Processing Systems, pp. 25 278–25 294, 2022.
- J. Ho and T. Salimans, “Classifier-Free Diffusion Guidance,” arXiv preprint arXiv:2207.12598, 2022.
- A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models,” arXiv preprint arXiv:2112.10741, 2021.
- A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
- Y. Gu, L. Wang, Z. Wang, Y. Liu, M.-M. Cheng, and S.-P. Lu, “Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection,” in AAAI Conference on Artificial Intelligence, 2020, pp. 10 869–10 876.
- O. Petit, N. Thome, C. Rambour, L. Themyr, T. Collins, and L. Soler, “U-Net Transformer: Self and Cross Attention for Medical Image Segmentation,” in Machine Learning in Medical Imaging Workshop. Springer, 2021, pp. 267–276.
- L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, “Normalized and Geometry-Aware Self-Attention Network for Image Captioning,” in Computer Vision and Pattern Recognition, 2020, pp. 10 324–10 333.
- H. F. Inc., “Models,” https://huggingface.co/models.
- C. Schuhmann, “LAION’s NSFW Detector,” https://github.com/LAION-AI/CLIP-based-NSFW-Detector.
- “Anti Deepnude,” https://github.com/1093842024/anti-deepnude.
- F. Elmenshawii, “Human Detection Dataset,” https://www.kaggle.com/datasets/fareselmenshawii/human-dataset.
- I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” arXiv preprint arXiv:1711.05101, 2017.
- A. I. . M. L. L. at TU Darmstadt, “Safe Stable Diffusion,” https://huggingface.co/AIML-TUDA/stable-diffusion-safe.
- notAI tech, “NudeNet: Lightweight Nudity Detection,” https://github.com/notAI-tech/NudeNet.
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,” 2018.
- G. Parmar, R. Zhang, and J.-Y. Zhu, “On Aliased Resizing and Surprising Subtleties in Gan Evaluation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 410–11 420.
- A. Kim, “NSFW Image Dataset,” https://github.com/alex000kim/nsfw_data_scraper.
- Pharmapsychotic, “Clip Interrogator,” https://github.com/pharmapsychotic/clip-interrogator/tree/main.
- P. von Platen, S. Patil, A. Lozhkov, P. Cuenca, N. Lambert, K. Rasul, M. Davaadorj, and T. Wolf, “Diffusers: State-of-the-art Diffusion Models,” https://github.com/huggingface/diffusers, 2022.
- C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps,” Advances in Neural Information Processing Systems, pp. 5775–5787, 2022.
- M. Brack, F. Friedrich, P. Schramowski, and K. Kersting, “Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World’s Ugliness?” arXiv preprint arXiv:2305.18398, 2023.
- C. Bird, E. Ungless, and A. Kasirzadeh, “Typology of Risks of Generative Text-to-Image Models,” in AAAI/ACM Conference on AI, Ethics, and Society, 2023, pp. 396–410.
- J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tramèr, “Red-Teaming the Stable Diffusion Safety Filter,” arXiv preprint arXiv:2210.04610, 2022.
- Y.-L. Tsai, C.-Y. Hsu, C. Xie, C.-H. Lin, J.-Y. Chen, B. Li, P.-Y. Chen, C.-M. Yu, and C.-Y. Huang, “Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?” arXiv preprint arXiv:2310.10012, 2023.
- H. Gao, H. Zhang, Y. Dong, and Z. Deng, “Evaluating the Robustness of Text-to-Image Diffusion Models against Real-world Attacks,” arXiv preprint arXiv:2306.13103, 2023.
- Y. Wu, N. Yu, M. Backes, Y. Shen, and Y. Zhang, “On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts,” arXiv preprint arXiv:2310.16613, 2023.
- S. Kim, S. Jung, B. Kim, M. Choi, J. Shin, and J. Lee, “Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models,” arXiv preprint arXiv:2307.05977, 2023.
- Xinfeng Li (38 papers)
- Yuchen Yang (60 papers)
- Jiangyi Deng (7 papers)
- Chen Yan (25 papers)
- Yanjiao Chen (16 papers)
- Xiaoyu Ji (19 papers)
- Wenyuan Xu (35 papers)