PromptSan-Modify: Inference-Time Prompt Sanitization
- PromptSan-Modify is an inference-time prompt sanitization technique that identifies and modifies harmful tokens for safe text-to-image generation.
- It employs gradient-based token importance scoring and iterative embedding updates to reduce NSFW risks without altering the core model.
- Evaluations demonstrate a significant drop in unsafe content while maintaining semantic and visual quality in generated images.
PromptSan-Modify is an inference-time prompt sanitization technique designed for safe text-to-image (T2I) generation, specifically targeting the mitigation of harmful content such as pornography, violence, and discriminatory imagery. The method leverages an integrated non-safe-for-work (NSFW) classifier to identify and selectively modify only the most harmful components of a user’s natural language input, thereby reducing the risks of misusing T2I models without altering model weights or fundamentally degrading generation performance (2506.18325).
1. Methodology and Core Workflow
PromptSan-Modify operates as an interposed, runtime procedure between the user’s input and the T2I model’s text encoder. Given an input prompt (where are tokens):
- Feature Extraction: The text encoder maps to an embedding space.
- Harmfulness Classification: A pre-trained binary NSFW classifier evaluates , outputting the probability that the prompt leads to harmful content.
- Threshold Check: If exceeds a predefined safety threshold , sanitization begins.
- Token Importance Scoring: For each token embedding , the magnitude of its gradient with respect to the NSFW classification loss,
where , quantifies the contribution of to the harmfulness prediction.
- Harmful Token Selection: Tokens are ranked by and a top- selection (controlled by a nucleus sampling ratio) identifies the most problematic tokens.
- Iterative Embedding Modification: For selected tokens, embeddings are updated via gradient descent:
where is the step size and is a binary mask indicating tokens under modification. The process is repeated for up to iterations or until .
This approach locally edits only the semantic representations of harmful tokens, thereby pushing the prompt from an NSFW-predicted region to a safe region of the classifier’s feature space.
2. Integration and Implementation Characteristics
PromptSan-Modify is designed for minimal disruption to model architectures:
- No Model Retraining: It requires neither parameter updates to the T2I/diffusion model nor any modification of the generation pipeline beyond the insertion of a prompt modifying filter.
- Modular Inference-Time Filtering: The sanitization is performed during prompt processing just prior to image generation, making it suitable for deployment as a plug-in filter in existing T2I services.
- Classifier-Agnostic: Any sufficiently accurate text-based NSFW classifier can serve as , provided it outputs differentiable probabilities with respect to token embeddings.
3. Effectiveness and Evaluation Metrics
The efficacy of PromptSan-Modify is demonstrated using several quantitative and qualitative metrics:
- NSFW Detection on I2P Dataset: The core gauge is the total number of pornographic or unsafe body part detections (via NudeNet) in generated images. PromptSan-Modify achieves a total of 43 detections, compared to 659 for Stable Diffusion v1.4 under identical prompts—demonstrating a substantial reduction in harmful content.
- Semantic and Visual Quality: On COCO-30k, FID (Frechet Inception Distance) and CLIP similarity are measured to ensure that modification does not degrade generation quality. PromptSan-Modify preserves both semantic similarity and diversity of generated images, with negligible drop in FID or CLIP scores.
- Safety/Usability Balance: The design preserves “safe” prompt content and does not simply block or censor input, reducing the risk of over-sanitization.
NSFW Detections | FID (COCO) | CLIP Score (COCO) | |
---|---|---|---|
Baseline SD v1.4 | 659 | (baseline) | (baseline) |
PromptSan-Modify | 43 | ~no change | ~no change |
4. Comparison with Alternative Approaches
PromptSan-Modify is contrasted with several classes of interventions:
- Data Cleansing / Fine-Tuning: These require retraining on curated datasets and may reduce model flexibility, as well as increase deployment complexity.
- Blocking or Hard Filtering: Rejecting entire prompts disrupts user experience and can lead to high rates of false positives, reducing creative applicability.
- Suffix-Based Mitigation (PromptSan-Suffix): Appends a learned sequence to prompts for neutralization. While also effective, this approach is less fine-grained and may influence broader aspects of generation semantics.
PromptSan-Modify’s targeted, token-level editing maintains more of the original intent and semantic richness, ensuring both safety and high-quality output.
5. Ethical, Practical, and Deployment Considerations
PromptSan-Modify addresses a core trade-off in T2I systems: the need to prevent harmful or unethical outputs (e.g., violent, pornographic, discriminatory content) without undermining the generative capabilities of the model:
- Ethical Alignment: By iteratively removing only what is identified as harmful and leaving the remainder untouched, user creativity and expression are preserved alongside robust safety measures.
- No Invasive Modifications: The methodology acts only at input representation, minimizing technical and operational integration costs.
- Runtime Safeguard: The classification/gradient-driven loop can be executed efficiently as part of the input preprocessing stage, supporting real-time or production-scale deployment scenarios.
6. Limitations and Extensions
- Classifier Dependency: The effectiveness and selectivity of PromptSan-Modify are limited by the accuracy and scope of the underlying text NSFW classifier. A weak or poorly calibrated classifier may miss edge cases or over-sanitize benign prompts.
- Potential for Adversarial Prompts: The method can be bypassed if the prompt is sufficiently obfuscated or adversarial, at which point improvements in the classifier or additional detection layers may be necessary.
- Non-Textual Risks: While the approach addresses harmful semantics presented in text, imagery resulting from ambiguous prompts or multimodal cues may still require downstream image-level filtering as a complement.
- Computational Overhead: Iterative gradient computation per prompt introduces marginal overhead, but with efficient implementations, this remains practical at inference time for most applications.
7. Significance and Prospects
PromptSan-Modify represents a practical, classifier-guided, inference-time prompt sanitization technique for safeguarding content in modern text-to-image generation pipelines. It advances the state of the art in balancing content moderation with creative flexibility, offering modular deployment for real-world open-ended T2I services. Its integration of classifier-driven, token-level, gradient-informed editing marks a substantive step towards ethical, usable generative AI (2506.18325).