Robust Concept Erasure Using Task Vectors (2404.03631v2)
Abstract: With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow unsafe generations with other inputs. Here we focus on unconditionally erasing a concept from a text-to-image model rather than conditioning the erasure on the user's prompt. We first show that compared to input-dependent erasure methods, concept erasure that uses Task Vectors (TV) is more robust to unexpected user inputs, not seen during training. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown. To this end, we propose a method called Diverse Inversion, which we use to estimate the required strength of the TV edit. Diverse Inversion finds within the model input space a large set of word embeddings, each of which induces the generation of the target concept. We find that encouraging diversity in the set makes our estimation more robust to unexpected prompts. Finally, we show that Diverse Inversion enables us to apply a TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining the core functionality of the model.
- Stability AI. Stable diffusion 2.0 release, 2022. Jul 9, 2023.
- Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- AUTOMATIC1111. Negative prompt, 2022.
- Praneeth Bedapudi. Nudenet: Neural nets for nudity detection and censoring, 2022.
- Jailbreaking black box large language models in twenty queries. CoRR, abs/2310.08419, 2023.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, 2020.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In International Conference on Learning Representations, 2023.
- Erasing concepts from diffusion models. In International Conference on Computer Vision, 2023a.
- Unified concept editing in diffusion models. In IEEE/CVF Winter Conference on Applications of Computer Vision, 2023b.
- Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In Annual Conference on Neural Information Processing Systems, 2023.
- Selective amnesia: A continual learning approach to forgetting in deep generative models. In Advances in Neural Information Processing Systems, 2023.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
- Patching open-vocabulary models by interpolating weights. In Advances in Neural Information Processing Systems, 2022.
- Editing models with task arithmetic. In International Conference on Learning Representations, 2023.
- Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence, 2018.
- Ablating concepts in text-to-image diffusion models. In International Conference on Computer Vision, 2023.
- Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
- An introduction to Kolmogorov complexity and its applications. Springer, 2008.
- Branch-train-merge: Embarrassingly parallel training of expert language models. CoRR, abs/2208.03306, 2022.
- Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t. arXiv preprint arXiv:2009.10713, 2020.
- Can neural network memorization be localized? In International Conference on Machine Learning, 2023.
- Merging models with fisher-weighted averaging. In Advances in Neural Information Processing Systems, 2022.
- Red teaming language models with language models. In Conference on Empirical Methods in Natural Language Processing, 2022.
- Circumventing concept erasure methods for text-to-image generative models. In International Conference on Learning Representations, 2024.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
- Red-teaming the stable diffusion safety filter. In Advances in Neural Information Processing Systems Workshop, 2022.
- High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition, 2022.
- Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Conference on Computer Vision and Pattern Recognition, 2023.
- LAION-5B: an open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems, 2022.
- Vikash Sehwag. Minimal implementation of diffusion models, 2021.
- Introduction to genetic algorithms. Springer, 2008.
- SmithMano. Tutorial: How to remove the safety filter in 5 seconds, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning Workshop, 2015.
- Ring-a-bell! how reliable are concept removal methods for diffusion models? In International Conference on Learning Representations, 2024.
- Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, 2022a.
- Robust fine-tuning of zero-shot models. In Conference on Computer Vision and Pattern Recognition, 2022b.
- Jailbreaking gpt-4v via self-adversarial attacks with system prompts. CoRR, abs/2311.09127, 2024.
- Forget-me-not: Learning to forget in text-to-image diffusion models. CoRR, abs/2303.17591, 2023.
- Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043, 2023.
Collections
Sign up for free to add this paper to one or more collections.