SAEmnesia: Supervised Unlearning in Diffusion Models
- SAEmnesia is a supervised unlearning framework that uses sparse autoencoders to enforce one-to-one concept–neuron mappings, enhancing interpretability in high-dimensional diffusion models.
- It employs a two-phase training process with unsupervised pre-training and supervised concept assignment, reducing computational overhead by 96.67% compared to baseline approaches.
- The framework achieves a 9.22% improvement in unlearning efficacy and is ideally suited for applications in content moderation, privacy compliance, and mechanistic interpretability research.
SAEmnesia is a supervised unlearning framework for text-to-image diffusion models that utilizes sparse autoencoder training to enable efficient, interpretable, and scalable concept removal in high-dimensional generative systems. The method systematically enforces one-to-one concept–neuron mappings, mitigating the challenges of distributed representations and feature entanglement inherent to dense diffusion model latents. SAEmnesia achieves strong improvements in unlearning accuracy, computational efficiency, and mechanistic interpretability compared to previous unsupervised and fine-tuning–based unlearning approaches.
1. Motivation and Theoretical Foundations
Effective concept unlearning in diffusion models requires precise localization of concept representations within the network's latent space. Prior approaches using sparse autoencoders reduce neuron polysemanticity (where single neurons represent multiple concepts), but individual concepts may still be fragmented across multiple latent features ("feature splitting"). This fragmentation necessitates combinatorial search and threshold selection during unlearning, resulting in substantial computational overhead.
SAEmnesia addresses this by introducing supervised sparse autoencoder training that uses systematic concept labeling. The supervised training promotes feature centralization, binding each target concept to a unique latent neuron. This mechanism enables interpretable, direct modification of latents for unlearning and obviates the need for exhaustive latent searches.
2. Supervised Sparse Autoencoder Architecture
SAEmnesia advances standard sparse autoencoder methodology by layering supervised and unsupervised objectives:
- Unsupervised SAE Pre-training: Latent encodings are first learned on activations extracted from the cross-attention blocks during the diffusion process.
- Supervised Phase: Concept labels are incorporated to guide the specialization of neurons. For each annotated concept in the training data, the model enforces strong association of that concept with a specific latent neuron.
The encoder–decoder operations are as follows:
where is the input activation vector, is the sparse latent code, and is the reconstruction.
The Top-K activation rule keeps only the largest elements in for each sample, enforcing sparsity at the neuron level. The supervised concept assignment employs the Concept Assignment (CA) loss: where indexes latent neurons assigned to present concepts.
To avoid concept entanglement, an orthogonality constraint minimizes inter-concept activation correlation.
The overall loss is a weighted sum:
includes the CA loss and orthogonality loss. In some variants, global cross-entropy over softmaxed latents is used:
3. Concept–Neuron Mapping and Interpretability
SAEmnesia enforces explicit one-to-one correspondences between concepts and latent activations. Given dataset splits (samples containing concept ) and (excluded samples), the association of concept with latent is quantified: where is the average activation and prevents division by zero.
Feature centralization is achieved iff every concept is maximally associated with a unique latent:
This structure dramatically increases the mechanistic interpretability of the latent space: intervening on the neuron reliably modulates only concept and not others.
4. Computational Efficiency and Hyperparameter Search
Conventional sparse unlearning approaches (e.g., [SAeUron], EDiff) require an exhaustive search over latent combinations and intervention strengths (e.g., 7 multipliers × 30 combinations = 210 evaluations). SAEmnesia’s one-to-one mapping reduces this to searching over a single dimension (e.g., 7 evaluations per concept), a 96.67% reduction in computational overhead.
The only additional expense incurred by SAEmnesia is the supervised cross-entropy (or CA) computation during training. At inference, direct manipulation of the mapped latent obviates the combinatorial search present in baseline techniques.
5. Experimental Validation and Unlearning Performance
SAEmnesia is evaluated on the UnlearnCanvas benchmark, where it achieves:
- 9.22% improvement in unlearning efficacy compared to the state-of-the-art unsupervised method (91.51% vs. 82.29% mean score).
- In sequential multi-concept tasks (e.g., 9-object removal), SAEmnesia yields 92.4% unlearning accuracy, surpassing previous baselines by 28.4% (baseline: 64%).
A table summarizing key efficiency improvements:
| Model | Hyperparameter Search (per concept) | UnlearnCanvas Accuracy (%) |
|---|---|---|
| SAeUron/Baseline | 210 | 82.29 |
| SAEmnesia | 7 | 91.51 |
This suggests that centralized mappings in SAEmnesia not only increase efficiency but also robustly scale as unlearning complexity increases.
6. Comparative Evaluation and Scalability
Relative to competing approaches (EDiff, ESD, SAeUron):
- Efficiency: SAEmnesia eliminates feature threshold searches and reduces fine-tuning requirements.
- Accuracy: Both single and multi-concept removal tasks show higher unlearning and retention metrics (evaluated as IRA and CRA).
- Computational Overhead: Training-time only; inference is direct and low-cost.
A plausible implication is that SAEmnesia’s scalable architecture is well-suited for large-scale, high-dimensional generative models where many concepts must be managed or erased serially.
7. Applications and Broader Implications
SAEmnesia’s interpretable, efficient unlearning paradigm has several concrete applications:
- Content Moderation: Automated suppression of undesirable (e.g., harmful, inappropriate, copyrighted) content in generated images.
- Privacy and Compliance: Targeted erasure of personal or sensitive visual attributes for regulatory adherence.
- Mechanistic Interpretability: Promotes transparent mapping between latent features and semantic concepts, providing a foundation for further research into neural interpretability.
Future research enabled by this approach includes extending interpretability-driven unlearning to text-to-video or multimodal architectures, or investigating adversarial robustness due to centralized concept representations—this suggests that the framework may enhance resistance against adversarial manipulation targeting distributed concept representations.
Conclusion
SAEmnesia is a supervised unlearning method for text-to-image diffusion models, implementing interpretable, one-to-one concept–neuron mappings via supervised sparse autoencoder training. This reduces neuron polysemanticity, minimizes computational effort (a 96.67% reduction in search), and improves unlearning accuracy (9.22% improvement on UnlearnCanvas, 28.4% in sequential tasks). Its interpretability and scalability mark it as an efficient and reliable protocol for targeted concept suppression in generative models, with implications for content safety, privacy compliance, and future mechanistic interpretability research (Cassano et al., 23 Sep 2025).