Papers
Topics
Authors
Recent
2000 character limit reached

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Published 29 Jan 2025 in cs.LG and cs.AI | (2501.18052v3)

Abstract: Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Our evaluation shows that SAeUron outperforms existing approaches on the UnlearnCanvas benchmark for concepts and style unlearning, and effectively eliminates nudity when evaluated with I2P. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content under adversarial attack. Code and checkpoints are available at https://github.com/cywinski/SAeUron.

Summary

  • The paper presents a novel approach leveraging sparse autoencoders for interpretable concept unlearning in diffusion models.
  • The method identifies and ablates concept-specific features at cross-attention blocks, preserving overall model performance.
  • Experimental results on the UnlearnCanvas benchmark demonstrate state-of-the-art robustness against adversarial attacks.

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

The paper "SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders" (2501.18052) introduces SAeUron, a novel approach for concept unlearning in text-to-image diffusion models using sparse autoencoders (SAEs). Unlike traditional fine-tuning methods, SAeUron leverages the interpretable features learned by SAEs to precisely remove unwanted concepts while preserving the model's overall performance. This approach offers a transparent way to analyze and block concepts prior to unlearning, enhancing interpretability and robustness against adversarial attacks.

Methodological Overview

SAeUron involves training sparse autoencoders on the internal activations of Stable Diffusion models, specifically targeting the cross-attention blocks at various denoising timesteps. The method identifies concept-specific features using a score function that measures the importance of each feature for a given concept. During inference, these identified features are ablated to remove the targeted concept's influence on the generated output (Figure 1). This process leverages the summative nature of SAEs and the sparsity of activated features to ensure minimal impact on the diffusion model's overall performance. Figure 1

Figure 1: Concept unlearning in SAeUron. We localize and ablate SAE features corresponding to unwanted concept (Cartoon) while preserving overall performance of diffusion model.

The score function is defined as:

score(i,t,c,D)=ÎŒ(i,t,Dc)∑j=1dÎŒ(j,t,Dc)+Ύ−Ό(i,t,DÂŹc)∑j=1dÎŒ(j,t,DÂŹc)+ÎŽ\text{score}(i, t, c, \mathcal{D}) = \frac{\mu(i,t, \mathcal{D}_c)}{\sum_{j=1}^{d}\mu(j,t,\mathcal{D}_c) + \delta} - \frac{\mu(i,t,\mathcal{D}_{\neg{c}})}{\sum_{j=1}^{d}\mu(j,t,\mathcal{D}_{\neg{c}}) + \delta}

where ÎŒ(i,t,D)\mu(i,t,\mathcal{D}) is the average activation of the ii-th feature on activations from timestep tt, Dc\mathcal{D}_c is the dataset containing the target concept, and DÂŹc\mathcal{D}_{\neg{c}} is the dataset without the target concept. The ablation is achieved by scaling the selected features with a negative multiplier Îłc\gamma_c, normalized by the average activation on concept samples.

Sparse Autoencoders for Diffusion Models

The adaptation of sparse autoencoders to Stable Diffusion models involves training them on activations from every denoising step tt. These activations are extracted from the cross-attention blocks and form feature maps Ft∈Rh×w×dF_t \in \mathbb{R}^{h \times w \times d}, where hh and ww denote the height and width of the feature map, and dd is the dimensionality of each feature vector. The SAEs are trained in an unsupervised manner to learn a set of sparse and semantically meaningful features.

The encoder and decoder of the ReLU sparse autoencoder are defined as:

z=ReLU(Wenc(x−bpre)+benc) x^=Wdecz+bpre\begin{aligned} \mathbf{z} &= \text{ReLU}\left(W_{\text{enc}}(\mathbf{x} - \mathbf{b}_{\text{pre}}) + \mathbf{b}_{\text{enc}}\right) \ \mathbf{\hat{x}} &= W_{\text{dec}}\mathbf{z} + \mathbf{b}_{\text{pre}} \end{aligned}

where Wenc∈Rn×dW_{\text{enc}} \in \mathbb{R}^{n\times d} and Wdec∈Rd×nW_{\text{dec}} \in \mathbb{R}^{d\times n} are the encoder and decoder weight matrices, and bpre∈Rd\mathbf{b}_{\text{pre}} \in \mathbb{R}^{d} and benc∈Rn\mathbf{b}_{\text{enc}} \in \mathbb{R}^{n} are learnable bias terms. The paper also explores the use of TopK activation functions and BatchTopK approaches to enhance sparsity and flexibility.

Experimental Validation

The SAeUron method was evaluated using the UnlearnCanvas benchmark, which includes 20 objects and 50 styles. The experiments demonstrated that SAeUron achieves state-of-the-art performance in unlearning without significantly affecting the diffusion model's overall performance. The method's robustness was also tested under adversarial attacks, showing that SAeUron effectively removes targeted concepts rather than merely masking them. Figure 2

Figure 2: Feature importance scores on validation set. Most of features have feature importance score close to zero signifying that SAE learns only a few concept-specific features. During the evaluation we find a threshold based on percentile of scores and block features with scores greater than it.

Implications and Future Directions

The SAeUron method provides a transparent and interpretable approach to concept unlearning in diffusion models. The use of sparse autoencoders allows for the identification and manipulation of specific features related to unwanted concepts. This approach shows potential for further research in mechanistic interpretability and the development of more robust and controllable generative models. Future work may explore extending SAeUron to other types of generative models and investigating its applicability in various real-world scenarios where concept unlearning is required. Furthermore, the ability to unlearn multiple concepts simultaneously and the demonstrated robustness against adversarial attacks highlight the practical advantages of SAeUron over traditional fine-tuning methods.

Conclusion

SAeUron introduces a novel and effective method for interpretable concept unlearning in diffusion models. By leveraging sparse autoencoders, the approach achieves state-of-the-art performance, robustness, and transparency. This work contributes to the field of machine unlearning and offers a promising direction for future research in controllable and interpretable AI.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 67 likes about this paper.