Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 37 tok/s
GPT-5 High 28 tok/s Pro
GPT-4o 110 tok/s
GPT OSS 120B 468 tok/s Pro
Kimi K2 236 tok/s Pro
2000 character limit reached

Robust Concept Erasure Using Task Vectors (2404.03631v2)

Published 4 Apr 2024 in cs.CV

Abstract: With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow unsafe generations with other inputs. Here we focus on unconditionally erasing a concept from a text-to-image model rather than conditioning the erasure on the user's prompt. We first show that compared to input-dependent erasure methods, concept erasure that uses Task Vectors (TV) is more robust to unexpected user inputs, not seen during training. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown. To this end, we propose a method called Diverse Inversion, which we use to estimate the required strength of the TV edit. Diverse Inversion finds within the model input space a large set of word embeddings, each of which induces the generation of the target concept. We find that encouraging diversity in the set makes our estimation more robust to unexpected prompts. Finally, we show that Diverse Inversion enables us to apply a TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining the core functionality of the model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Stability AI. Stable diffusion 2.0 release, 2022. Jul 9, 2023.
  2. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
  3. AUTOMATIC1111. Negative prompt, 2022.
  4. Praneeth Bedapudi. Nudenet: Neural nets for nudity detection and censoring, 2022.
  5. Jailbreaking black box large language models in twenty queries. CoRR, abs/2310.08419, 2023.
  6. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, 2020.
  7. An image is worth one word: Personalizing text-to-image generation using textual inversion. In International Conference on Learning Representations, 2023.
  8. Erasing concepts from diffusion models. In International Conference on Computer Vision, 2023a.
  9. Unified concept editing in diffusion models. In IEEE/CVF Winter Conference on Applications of Computer Vision, 2023b.
  10. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In Annual Conference on Neural Information Processing Systems, 2023.
  11. Selective amnesia: A continual learning approach to forgetting in deep generative models. In Advances in Neural Information Processing Systems, 2023.
  12. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
  13. Patching open-vocabulary models by interpolating weights. In Advances in Neural Information Processing Systems, 2022.
  14. Editing models with task arithmetic. In International Conference on Learning Representations, 2023.
  15. Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence, 2018.
  16. Ablating concepts in text-to-image diffusion models. In International Conference on Computer Vision, 2023.
  17. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  18. An introduction to Kolmogorov complexity and its applications. Springer, 2008.
  19. Branch-train-merge: Embarrassingly parallel training of expert language models. CoRR, abs/2208.03306, 2022.
  20. Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t. arXiv preprint arXiv:2009.10713, 2020.
  21. Can neural network memorization be localized? In International Conference on Machine Learning, 2023.
  22. Merging models with fisher-weighted averaging. In Advances in Neural Information Processing Systems, 2022.
  23. Red teaming language models with language models. In Conference on Empirical Methods in Natural Language Processing, 2022.
  24. Circumventing concept erasure methods for text-to-image generative models. In International Conference on Learning Representations, 2024.
  25. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  26. Red-teaming the stable diffusion safety filter. In Advances in Neural Information Processing Systems Workshop, 2022.
  27. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition, 2022.
  28. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Conference on Computer Vision and Pattern Recognition, 2023.
  29. LAION-5B: an open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems, 2022.
  30. Vikash Sehwag. Minimal implementation of diffusion models, 2021.
  31. Introduction to genetic algorithms. Springer, 2008.
  32. SmithMano. Tutorial: How to remove the safety filter in 5 seconds, 2022.
  33. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning Workshop, 2015.
  34. Ring-a-bell! how reliable are concept removal methods for diffusion models? In International Conference on Learning Representations, 2024.
  35. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
  36. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, 2022a.
  37. Robust fine-tuning of zero-shot models. In Conference on Computer Vision and Pattern Recognition, 2022b.
  38. Jailbreaking gpt-4v via self-adversarial attacks with system prompts. CoRR, abs/2311.09127, 2024.
  39. Forget-me-not: Learning to forget in text-to-image diffusion models. CoRR, abs/2303.17591, 2023.
  40. Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043, 2023.
Citations (9)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.