Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models (2503.09446v2)

Published 12 Mar 2025 in cs.CV, cs.AI, and cs.CR

Abstract: Text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images but also raise people's concerns about generating harmful or misleading content. While extensive approaches have been proposed to erase unwanted concepts without requiring retraining from scratch, they inadvertently degrade performance on normal generation tasks. In this work, we propose Interpret then Deactivate (ItD), a novel framework to enable precise concept removal in T2I diffusion models while preserving overall performance. ItD first employs a sparse autoencoder (SAE) to interpret each concept as a combination of multiple features. By permanently deactivating the specific features associated with target concepts, we repurpose SAE as a zero-shot classifier that identifies whether the input prompt includes target concepts, allowing selective concept erasure in diffusion models. Moreover, we demonstrate that ItD can be easily extended to erase multiple concepts without requiring further training. Comprehensive experiments across celebrity identities, artistic styles, and explicit content demonstrate ItD's effectiveness in eliminating targeted concepts without interfering with normal concept generation. Additionally, ItD is also robust against adversarial prompts designed to circumvent content filters. Code is available at: https://github.com/NANSirun/Interpret-then-deactivate.

Summary

The paper introduces 'Interpret then Deactivate' (ItD), a framework that leverages Sparse Autoencoders for zero-shot concept erasing in text-to-image models.
It employs a contrast-based feature selection method to identify and selectively deactivate target concepts while preserving non-target content.
Experiments demonstrate robust removal of 50 celebrity identities, 100 artistic styles, and NSFW content with minimal impact on overall image quality.

This paper introduces "Interpret then Deactivate" (ItD), a framework for precisely erasing unwanted concepts (like specific identities, styles, or NSFW content) from text-to-image (T2I) diffusion models without degrading performance on generating normal, unrelated concepts (Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models, 12 Mar 2025). Existing methods often suffer from "forgetting," where erasing one concept negatively impacts others. ItD aims to overcome this by being both precise and expandable (erasing new concepts without retraining).

Core Idea: Using Sparse Autoencoders (SAE)

The central idea is to use a Sparse Autoencoder (SAE), an unsupervised model, to interpret the semantic space of the T2I model's text encoder.

Interpretation with SAE:
- An SAE is trained on the token embeddings extracted from the residual stream of intermediate transformer blocks within the text encoder (specifically, layer 8 was found effective empirically).
- The SAE learns to reconstruct these token embeddings ( $e$ ) as a sparse linear combination of features ( $f_\rho$ ): $e \approx \sum_{\rho} z^\rho f_\rho$ , where $z^\rho$ are the sparse activations ( $||z||_0 \leq K$ ). This means each token embedding, and thus concepts composed of tokens, can be represented by a small set of active features from a large dictionary learned by the SAE.
- The paper uses a K-Sparse Autoencoder (K-SAE) variant, explicitly keeping only the top-K activations for reconstruction, and includes an auxiliary loss to prevent "dead features."
Feature Selection for Erasure:
- To erase a target concept $C_{tar}$ , ItD first identifies the set of features $F_{tar}$ strongly activated by its token embeddings.
- Crucially, to avoid impacting normal concepts, it employs a contrast-based approach. It identifies features $F_{C_r}$ activated by a set of "retain" concepts $\mathcal{C}_{retain}$ (concepts intended to be preserved).
- The features selected for deactivation $\hat{F}_{tar}$ are those specific to the target concept, found by removing features shared with retain concepts: $\hat{F}_{\text {tar}} = F_{\text {tar}} \setminus \bigcup_{C_r \in C_{\text {retain}}} F_{C_r}$ .
- For erasing multiple concepts, the final set of features to deactivate ( $F_{erase}$ ) is the union of the specific features for each target concept. This makes the method expandable without retraining the SAE.
Deactivation during Inference:
- The trained SAE (encoder and decoder) is wrapped into a "deactivation block" inserted into the text encoder's residual stream (e.g., after layer 8).
- During image generation, when a text prompt's embedding $e$ $e$ passes through this block:
  - It's encoded into activations $s$ by the SAE encoder.
  - Activations $s^\rho$ corresponding to the erasure features ( $\rho \in F_{erase}$ ) are scaled down by a factor $\tau$ (e.g., $\tau=0.1$ or lower): $\hat{s}^\rho = s^\rho \cdot \tau$ . Other activations remain unchanged.
  - The modified activations $\hat{s}$ are decoded back into a modified embedding $\hat{e}$ by the SAE decoder.
  - This modified embedding $\hat{e}$ , now lacking the targeted concept information, is passed to the subsequent layers and the U-Net for image generation.
Selective Deactivation using SAE as Zero-Shot Classifier:
- To further minimize impact on normal concepts potentially affected by the SAE's reconstruction process itself, ItD introduces a selective application mechanism.
- It repurposes the SAE to classify if an input embedding $e$ likely contains a target concept. This is done by calculating the reconstruction error $||e - \hat{e}||^2$ (using the original SAE reconstruction, before targeted deactivation).
- If the reconstruction error is below a threshold $\tau'$ , the embedding is likely related to a target concept (reconstructs well, or deactivation would significantly alter it), and the deactivated embedding $\hat{e}$ is used.
- If the error is above the threshold, the embedding is likely unrelated to target concepts, and the original embedding $e$ is passed through, bypassing the deactivation step entirely.
- Experiments show a clear separation in reconstruction loss between target and non-target concepts (Figure 4), making this classification feasible.

Implementation Details and Considerations:

Placement: Applying SAE to the text encoder (layer 8 residual stream) is preferred over U-Net modification due to lower complexity, dominant role of text embeddings, and potentially better robustness.
Training: SAE training is unsupervised using text prompts (e.g., celebrity names, artist names, COCO captions). It's efficient, taking under an hour on a single H100.
Efficiency: The SAE block adds minimal inference overhead (< 1% of total generation time), mainly involving two matrix multiplications. The overhead is constant regardless of the number of concepts being erased.
Hyperparameters: Key parameters include the K in K-SAE, the deactivation strength $\tau$ , the classification threshold $\tau'$ , the SAE hidden dimension $d_{hid}$ , and the selection threshold $K_{sel}$ . Ablations show robustness to $\tau$ and identify layer 8 and large $d_{hid}$ (e.g., $2^{19}$ ) as effective choices.
Expandability: Adding new concepts to erase only requires identifying their specific features using the pre-trained SAE and the contrastive method; no SAE retraining is needed.

Experiments and Results:

ItD was evaluated on erasing 50 celebrities, 100 artistic styles, and nudity/explicit content from Stable Diffusion 1.4.
It was compared against fine-tuning (ESD-x, AC, AdvUn, Receler) and inference-based (MACE, CPE) methods.
Metrics included erasure success (CLIP Score, ACC, NudeNet detection) and preservation of remaining concepts (CLIP Score, ACC, FID, KID) on various datasets (including unseen ones like DiffusionDB-10K).
Results show ItD effectively erases target concepts while significantly outperforming baselines in preserving generation quality for remaining and unseen concepts (Tables 1, 2, 4; Figures 5, 6).
ItD also demonstrated strong robustness against adversarial prompts designed to bypass safety filters (Table 3).

Conclusion:

The paper presents ItD, a novel method using Sparse Autoencoders to interpret T2I text encoder semantics and perform precise, expandable concept erasure. By identifying concept-specific features via a contrastive approach and selectively deactivating them using the SAE as a zero-shot classifier, ItD effectively removes unwanted content with minimal impact on the model's general capabilities, offering a practical solution for safer T2I model deployment.

PDF Markdown

Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models (2503.09446v2)

Summary

Related Papers