ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning (2405.19237v1)
Abstract: While large-scale text-to-image diffusion models have demonstrated impressive image-generation capabilities, there are significant concerns about their potential misuse for generating unsafe content, violating copyright, and perpetuating societal biases. Recently, the text-to-image generation community has begun addressing these concerns by editing or unlearning undesired concepts from pre-trained models. However, these methods often involve data-intensive and inefficient fine-tuning or utilize various forms of token remapping, rendering them susceptible to adversarial jailbreaks. In this paper, we present a simple and effective training-free approach, ConceptPrune, wherein we first identify critical regions within pre-trained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning. Experiments across a range of concepts including artistic styles, nudity, object erasure, and gender debiasing demonstrate that target concepts can be efficiently erased by pruning a tiny fraction, approximately 0.12% of total weights, enabling multi-concept erasure and robustness against various white-box and black-box adversarial attacks.
- On the pitfalls of analyzing individual neurons in language models. ICLR, 2022.
- Praneeth Bedapudi. Nudenet: Neural nets for nudity detection and censoring. 2022.
- What is the state of neural network pruning? MLSys, 2020.
- Coyo-700m: Image-text pair dataset. 2022.
- Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. AAAI, 2018.
- Imagenet: a large-scale hierarchical image database. 2009.
- Discovering salient neurons in deep nlp models. JMLR, 2023.
- Analyzing individual neurons in pre-trained language models. EMNLP, 2020.
- Gemini Team et al. Gemini: A family of highly capable multimodal models. arXiv, 2024.
- Case no.3:2023cv00201. us district court for the northern district of california.,. 2023.
- Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation. ICLR, 2024.
- Structural pruning for diffusion models. NeurIPS, 2023.
- Camera Forensics. The dark reality of stable diffusion. 2024.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. ICLR, 2019.
- Sparsegpt: Massive language models can be accurately pruned in one-shot. ICML, 2023.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. ICLR, 2023.
- Erasing concepts from diffusion models. In ICCV, 2023.
- Unified concept editing in diffusion models. WACV, 2023.
- Learning both weights and connections for efficient neural networks. NeurIPS, 2015.
- Deep residual learning for image recognition. CVPR, 2015.
- Gaussian error linear units (gelus). arXiv, 2023.
- Selective amnesia: A continual learning approach to forgetting in deep generative models. In NeurIPS, 2023.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Fastai: A layered api for deep learning. Information, 11, 2020.
- Ablating concepts in text-to-image diffusion models. In ICCV, 2023.
- Optimal brain damage. In Advances in Neural Information Processing Systems, 1989.
- Snip: Single-shot network pruning based on connection sensitivity. ICLR, 2019.
- Pruning filters for efficient convnets. ICLR, 2017.
- Implicit concept removal of diffusion models. arXiv, 2024.
- Rethinking the value of network pruning. In ICLR, 2019.
- Stable bias: Evaluating societal representations in diffusion models. NeurIPS, 2023.
- Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv, 2023.
- Understanding deep image representations by inverting them. In CVPR, 2015.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. ICML, 2022.
- Editing implicit assumptions in text-to-image diffusion models. ICCV, 2023.
- Training language models to follow instructions with human feedback. NeurIPS, 2022.
- Circumventing concept erasure methods for text-to-image generative models. ICLR, 2024.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. ICLR, 2024.
- MIT Technology Review. Text-to-image ai models can be tricked into generating disturbing images. 2023.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- High-resolution image synthesis with latent diffusion models. CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. MICCAI, 2015.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In CVPR, 2023.
- LAION-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
- Noam Shazeer. Glu variants improve transformer. arXiv, 2020.
- Denoising diffusion implicit models. In ICLR, 2021.
- Finding experts in transformer models. arXiv preprint, 2020.
- A simple and effective pruning approach for large language models. ICLR, 2024.
- Ring-a-bell! how reliable are concept removal methods for diffusion models? ICLR, 2024.
- Finding skill neurons in pre-trained transformer-based language models. EMNLP, 2022.
- Assessing the brittleness of safety alignment via pruning and low-rank modifications. arXiv preprint arXiv:2402.05162, 2024.
- Erasediff: Erasing data influence in diffusion models. arXiv preprint, 2024.
- Mma-diffusion: Multimodal attack on diffusion models. CVPR, 2024.
- Forget-me-not: Learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2211.08332, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, October 2023.
- A survey of diffusion based image generation models: Issues and their solutions. IEEE PAMI, 2023.
- To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images… for now. arXiv preprint, 2023.
- Moefication: Transformer feed-forward layers are mixtures of experts. ACL, 2022.
- Emergent modularity in pre-trained transformers. ACL, 2023.
- Gender bias in coreference resolution: Evaluation and debiasing methods. NAACL, 2018.
- Separable multi-concept erasure from diffusion models. arXiv, 2024.