- The paper introduces 'Interpret then Deactivate' (ItD), a framework that leverages Sparse Autoencoders for zero-shot concept erasing in text-to-image models.
- It employs a contrast-based feature selection method to identify and selectively deactivate target concepts while preserving non-target content.
- Experiments demonstrate robust removal of 50 celebrity identities, 100 artistic styles, and NSFW content with minimal impact on overall image quality.
This paper introduces "Interpret then Deactivate" (ItD), a framework for precisely erasing unwanted concepts (like specific identities, styles, or NSFW content) from text-to-image (T2I) diffusion models without degrading performance on generating normal, unrelated concepts (Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models, 12 Mar 2025). Existing methods often suffer from "forgetting," where erasing one concept negatively impacts others. ItD aims to overcome this by being both precise and expandable (erasing new concepts without retraining).
Core Idea: Using Sparse Autoencoders (SAE)
The central idea is to use a Sparse Autoencoder (SAE), an unsupervised model, to interpret the semantic space of the T2I model's text encoder.
- Interpretation with SAE:
- An SAE is trained on the token embeddings extracted from the residual stream of intermediate transformer blocks within the text encoder (specifically, layer 8 was found effective empirically).
- The SAE learns to reconstruct these token embeddings (e) as a sparse linear combination of features (fρ): e≈∑ρzρfρ, where zρ are the sparse activations (∣∣z∣∣0≤K). This means each token embedding, and thus concepts composed of tokens, can be represented by a small set of active features from a large dictionary learned by the SAE.
- The paper uses a K-Sparse Autoencoder (K-SAE) variant, explicitly keeping only the top-K activations for reconstruction, and includes an auxiliary loss to prevent "dead features."
- Feature Selection for Erasure:
- To erase a target concept Ctar, ItD first identifies the set of features Ftar strongly activated by its token embeddings.
- Crucially, to avoid impacting normal concepts, it employs a contrast-based approach. It identifies features FCr activated by a set of "retain" concepts Cretain (concepts intended to be preserved).
- The features selected for deactivation F^tar are those specific to the target concept, found by removing features shared with retain concepts: F^tar=Ftar∖Cr∈Cretain⋃FCr.
- For erasing multiple concepts, the final set of features to deactivate (Ferase) is the union of the specific features for each target concept. This makes the method expandable without retraining the SAE.
- Deactivation during Inference:
- The trained SAE (encoder and decoder) is wrapped into a "deactivation block" inserted into the text encoder's residual stream (e.g., after layer 8).
- During image generation, when a text prompt's embedding e passes through this block:
- It's encoded into activations s by the SAE encoder.
- Activations sρ corresponding to the erasure features (ρ∈Ferase) are scaled down by a factor τ (e.g., τ=0.1 or lower): s^ρ=sρ⋅τ. Other activations remain unchanged.
- The modified activations s^ are decoded back into a modified embedding e^ by the SAE decoder.
- This modified embedding e^, now lacking the targeted concept information, is passed to the subsequent layers and the U-Net for image generation.
- Selective Deactivation using SAE as Zero-Shot Classifier:
- To further minimize impact on normal concepts potentially affected by the SAE's reconstruction process itself, ItD introduces a selective application mechanism.
- It repurposes the SAE to classify if an input embedding e likely contains a target concept. This is done by calculating the reconstruction error ∣∣e−e^∣∣2 (using the original SAE reconstruction, before targeted deactivation).
- If the reconstruction error is below a threshold τ′, the embedding is likely related to a target concept (reconstructs well, or deactivation would significantly alter it), and the deactivated embedding e^ is used.
- If the error is above the threshold, the embedding is likely unrelated to target concepts, and the original embedding e is passed through, bypassing the deactivation step entirely.
- Experiments show a clear separation in reconstruction loss between target and non-target concepts (Figure 4), making this classification feasible.
Implementation Details and Considerations:
- Placement: Applying SAE to the text encoder (layer 8 residual stream) is preferred over U-Net modification due to lower complexity, dominant role of text embeddings, and potentially better robustness.
- Training: SAE training is unsupervised using text prompts (e.g., celebrity names, artist names, COCO captions). It's efficient, taking under an hour on a single H100.
- Efficiency: The SAE block adds minimal inference overhead (< 1% of total generation time), mainly involving two matrix multiplications. The overhead is constant regardless of the number of concepts being erased.
- Hyperparameters: Key parameters include the K in K-SAE, the deactivation strength τ, the classification threshold τ′, the SAE hidden dimension dhid, and the selection threshold Ksel. Ablations show robustness to τ and identify layer 8 and large dhid (e.g., 219) as effective choices.
- Expandability: Adding new concepts to erase only requires identifying their specific features using the pre-trained SAE and the contrastive method; no SAE retraining is needed.
Experiments and Results:
- ItD was evaluated on erasing 50 celebrities, 100 artistic styles, and nudity/explicit content from Stable Diffusion 1.4.
- It was compared against fine-tuning (ESD-x, AC, AdvUn, Receler) and inference-based (MACE, CPE) methods.
- Metrics included erasure success (CLIP Score, ACC, NudeNet detection) and preservation of remaining concepts (CLIP Score, ACC, FID, KID) on various datasets (including unseen ones like DiffusionDB-10K).
- Results show ItD effectively erases target concepts while significantly outperforming baselines in preserving generation quality for remaining and unseen concepts (Tables 1, 2, 4; Figures 5, 6).
- ItD also demonstrated strong robustness against adversarial prompts designed to bypass safety filters (Table 3).
Conclusion:
The paper presents ItD, a novel method using Sparse Autoencoders to interpret T2I text encoder semantics and perform precise, expandable concept erasure. By identifying concept-specific features via a contrastive approach and selectively deactivating them using the SAE as a zero-shot classifier, ItD effectively removes unwanted content with minimal impact on the model's general capabilities, offering a practical solution for safer T2I model deployment.