Bidirectional Image-Guided Concept Erasure
- Bidirectional Image-Guided Concept Erasure is a novel framework that employs dual (negative and positive) image guidance to fine-tune diffusion models, addressing over-suppression issues.
- It integrates mask-based filtering and parallel cross-attention channels within UNet to isolate target concepts and maintain image clarity during erasure.
- Experimental results highlight an 84% reduction in unsafe outputs and improved trade-offs between erasure accuracy and generation quality.
Bidirectional Image-Guided Concept Erasure (Bi-Erasing) is a framework for fine-tuning diffusion models to remove unwanted visual concepts while preserving image fidelity and usability. The approach addresses the limitations of unidirectional concept erasure, which typically destabilizes model behavior through over-suppression or semantic drift, by introducing complementary negative and positive image conditioning. Bi-Erasing jointly optimizes these opposing directions to achieve effective concept suppression and constructive guidance toward safe alternatives, stabilizing the denoising trajectory and balancing erasure efficacy with output quality (Chen et al., 15 Dec 2025).
1. Motivation and Theoretical Foundations
Unidirectional erasure approaches such as ESD, FMN, and TRCE apply only a suppressive (“push-only”) force to repel unwanted concepts, resulting in degraded outputs manifested as trajectory drift, semantic collapse, and content voids. These approaches lack a constructive signal to fill erased regions with plausible content. Bi-Erasing introduces a push–pull mechanism, comprising a negative branch to repel unsafe semantics and a positive branch to pull the generation toward coherent, benign content. Both branches operate over a joint representation space that combines text prompt and image guidance. This dual conditioning stabilizes diffusion model behavior and resolves the trade-off between erasure effectiveness and generation quality (Chen et al., 15 Dec 2025).
2. Framework Architecture and Training Algorithm
Bi-Erasing operates by decoupling negative and positive image guidance into parallel cross-attention channels within the UNet backbone. The framework components and workflow are as follows:
- Input Encodings:
- Text prompts are encoded with a frozen CLIP text encoder and injected into UNet via cross-attention.
- Negative and positive example images are embedded using a frozen CLIP vision encoder (), then mapped into the text-attention space by a learnable projector ().
- Resulting embeddings and are separately incorporated into the corresponding negative and positive cross-attention branches alongside the text conditioning ().
- Mask-Based Preprocessing:
- Semantic masks are generated offline using a CLIP-based segmentation model, isolating concept-related regions in both negative and positive images.
- Masked images are defined as , where is the image’s mean color, effectively filtering out distractors.
- Algorithmic Steps (per training iteration):
- Negative:
- Positive:
- 6. Predict with the trainable UNet using negative and positive conditioning.
- 7. Compute losses and perform backpropagation to update learned parameters (Chen et al., 15 Dec 2025).
The training objective aggregates mean squared error (MSE) losses for each branch: with and the branch-wise MSE losses, and weighting hyperparameters.
3. Mask-Based Filtering and Concept Localization
Mask-based filtering targets only content-relevant regions during training, mitigating background distractions and forcing the vision encoder and cross-attention modules to focus on the intended concept. Semantic masks are generated using CLIP-guided segmentation, applied to both negative and positive images. Masked images, retaining only the essential pixels for the target concept, are encoded before projection, isolating the supervisory signal and increasing the specificity of the push and pull actions during fine-tuning (Chen et al., 15 Dec 2025).
4. Evaluation Protocols and Experimental Results
Bi-Erasing was validated on both safety-related and general-domain erasure tasks:
- Safety Erasure (Nudity)
- Dataset: I2P, 4,703 prompts.
- Metrics: NudeNet detection (for breasts/genitalia), CLIP score (prompt–image alignment), FID (against COCO-10K), and Attack Success Rate (ASR) under adversarial prompting.
- Results:
- | Model | Nude Detections | FID | CLIP | Pre-ASR | Post-ASR |
- |---------------|-----------------|-------|-------|---------|----------|
- | SDv1.5 | 507 | 14.75 | 0.315 | 82.4% | 95.8% |
- | Bi-Erasing | 80 | 18.46 | 0.304 | 15.3% | 62.7% |
- | SalUn | – | 33.60 | – | – | – |
Bi-Erasing achieves an 84% reduction in nude detections relative to the baseline, outperforming all primary baselines in the FID ↔ erasure trade-off (Chen et al., 15 Dec 2025).
- General-Domain (Celebrity & Artist-Style)
- Metrics: Celebrity: (harmonic mean of erasure accuracy and specificity), Artist: (harmonic mean of CLIP-based erasure/specificity).
- Results: For one-celebrity, MACE: , Bi + MACE: . Specificity even at maximum category count, with clear improvements in artist-style removal.
Qualitative examination under adversarial prompting demonstrates the stability of Bi-Erasing: one-sided methods frequently yield unsafe or stylistically divergent outputs, while Bi-Erasing generates consistent, safe, and coherent results (Chen et al., 15 Dec 2025).
5. Component Analysis and Ablation Studies
Ablation studies isolate key contributions:
- Component Impact (Table 5):
- Text-only erasure: ASR = 0.48, CLIP = 27.47, FID = 18.18.
- Text + Bi-directional Image: ASR = 0.32, CLIP = 26.19, FID = 18.92.
- BI only: ASR = 0.90, CLIP = 29.24, FID = 17.58 (indicates poor erasure if only image guidance is used).
- BI + Mask: ASR = 0.86, CLIP = 29.03, FID = 16.97.
- Full (Text + BI + Mask): ASR = 0.18, CLIP = 25.40, FID = 18.46.
- Data Regime and Weighting:
- Expanding training pairs from 10 to 200 reduces ASR from 0.436 to 0.180, with diminishing returns beyond 100 examples.
- Varying push/pull weights clarifies the safety–fidelity trade-off. Dynamic scheduling—gradually increasing the pull (positive) strength—performs optimally compared to any fixed setting.
6. Implementation and Computational Considerations
- Base Model: Stable Diffusion v1.5 UNet, 50 inference steps, 4 seconds per image at 64-bit precision on RTX 4090 GPU.
- Training Setup: Two RTX 4090 GPUs, learning rate , convergence within approximately 40 minutes.
- Frozen Components: CLIP vision encoder, reference UNet .
- Learned Modules: Projector , UNet branch adapters.
- Mask Generation: Offline with CLIP-based segmentation.
- Parameter Scheduling: Adaptive adjustment of and maintains loss stability through training (Chen et al., 15 Dec 2025).
7. Contributions, Limitations, and Future Directions
Bi-Erasing advances concept erasure by resolving issues of over-suppression, drift, and content void arising from unidirectional approaches. It introduces simultaneous negative and positive image-guided branches, mask-based filtering for spatial specificity, and dynamic weighting for adaptive supervision, demonstrating superior trade-offs across safety and general-domain tasks. Primary limitations include manual selection of positive guidance samples and the need for automated concept pairing. Extending the push–pull paradigm to multi-modal erasure (e.g., video, 3D) and refining adaptive weighting mechanisms remain important future challenges (Chen et al., 15 Dec 2025).