PromptSan-Modify: Inference-Time Prompt Sanitization

Updated 8 July 2025

PromptSan-Modify is an inference-time prompt sanitization technique that identifies and modifies harmful tokens for safe text-to-image generation.
It employs gradient-based token importance scoring and iterative embedding updates to reduce NSFW risks without altering the core model.
Evaluations demonstrate a significant drop in unsafe content while maintaining semantic and visual quality in generated images.

PromptSan-Modify is an inference-time prompt sanitization technique designed for safe text-to-image (T2I) generation, specifically targeting the mitigation of harmful content such as pornography, violence, and discriminatory imagery. The method leverages an integrated non-safe-for-work (NSFW) classifier to identify and selectively modify only the most harmful components of a user’s natural language input, thereby reducing the risks of misusing T2I models without altering model weights or fundamentally degrading generation performance (Xie et al., 23 Jun 2025).

1. Methodology and Core Workflow

PromptSan-Modify operates as an interposed, runtime procedure between the user’s input and the T2I model’s text encoder. Given an input prompt $T = (w_1, w_2, ..., w_n)$ (where $w_i$ are tokens):

Feature Extraction: The text encoder $E_{\mathrm{text}}$ maps $T$ to an embedding space.
Harmfulness Classification: A pre-trained binary NSFW classifier $C_{\mathrm{text}}$ evaluates $E_{\mathrm{text}}(T)$ , outputting the probability $p$ that the prompt leads to harmful content.
Threshold Check: If $p$ exceeds a predefined safety threshold $\gamma$ , sanitization begins.
Token Importance Scoring: For each token embedding $e_i = E_{\mathrm{text}}(w_i)$ , the magnitude of its gradient with respect to the NSFW classification loss,

$g_i = \left\| \nabla_{e_i} L_{\mathrm{text}}(T) \right\|_{\infty},$

where $L_{\mathrm{text}}(T) = -\log(1 - C_{\mathrm{text}}(E_{\mathrm{text}}(T)))$ , quantifies the contribution of $w_i$ to the harmfulness prediction.

Harmful Token Selection: Tokens are ranked by $g_i$ and a top- $p$ selection (controlled by a nucleus sampling ratio) identifies the most problematic tokens.
Iterative Embedding Modification: For selected tokens, embeddings are updated via gradient descent:

$e_j^{(t+1)} = e_j^{(t)} - \eta \cdot (\nabla_{e_j} L_{\mathrm{text}}(T) \cdot M_j)$

where $\eta$ is the step size and $M$ is a binary mask indicating tokens under modification. The process is repeated for up to $N$ iterations or until $C_{\mathrm{text}}(E_{\mathrm{text}}(T^{(t)})) < \gamma$ .

This approach locally edits only the semantic representations of harmful tokens, thereby pushing the prompt from an NSFW-predicted region to a safe region of the classifier’s feature space.

2. Integration and Implementation Characteristics

PromptSan-Modify is designed for minimal disruption to model architectures:

No Model Retraining: It requires neither parameter updates to the T2I/diffusion model nor any modification of the generation pipeline beyond the insertion of a prompt modifying filter.
Modular Inference-Time Filtering: The sanitization is performed during prompt processing just prior to image generation, making it suitable for deployment as a plug-in filter in existing T2I services.
Classifier-Agnostic: Any sufficiently accurate text-based NSFW classifier can serve as $C_{\mathrm{text}}$ , provided it outputs differentiable probabilities with respect to token embeddings.

3. Effectiveness and Evaluation Metrics

The efficacy of PromptSan-Modify is demonstrated using several quantitative and qualitative metrics:

NSFW Detection on I2P Dataset: The core gauge is the total number of pornographic or unsafe body part detections (via NudeNet) in generated images. PromptSan-Modify achieves a total of 43 detections, compared to 659 for Stable Diffusion v1.4 under identical prompts—demonstrating a substantial reduction in harmful content.
Semantic and Visual Quality: On COCO-30k, FID (Frechet Inception Distance) and CLIP similarity are measured to ensure that modification does not degrade generation quality. PromptSan-Modify preserves both semantic similarity and diversity of generated images, with negligible drop in FID or CLIP scores.
Safety/Usability Balance: The design preserves “safe” prompt content and does not simply block or censor input, reducing the risk of over-sanitization.

	NSFW Detections	FID (COCO)	CLIP Score (COCO)
Baseline SD v1.4	659	(baseline)	(baseline)
PromptSan-Modify	43	~no change	~no change

4. Comparison with Alternative Approaches

PromptSan-Modify is contrasted with several classes of interventions:

Data Cleansing / Fine-Tuning: These require retraining on curated datasets and may reduce model flexibility, as well as increase deployment complexity.
Blocking or Hard Filtering: Rejecting entire prompts disrupts user experience and can lead to high rates of false positives, reducing creative applicability.
Suffix-Based Mitigation (PromptSan-Suffix): Appends a learned sequence to prompts for neutralization. While also effective, this approach is less fine-grained and may influence broader aspects of generation semantics.

PromptSan-Modify’s targeted, token-level editing maintains more of the original intent and semantic richness, ensuring both safety and high-quality output.

5. Ethical, Practical, and Deployment Considerations

PromptSan-Modify addresses a core trade-off in T2I systems: the need to prevent harmful or unethical outputs (e.g., violent, pornographic, discriminatory content) without undermining the generative capabilities of the model:

Ethical Alignment: By iteratively removing only what is identified as harmful and leaving the remainder untouched, user creativity and expression are preserved alongside robust safety measures.
No Invasive Modifications: The methodology acts only at input representation, minimizing technical and operational integration costs.
Runtime Safeguard: The classification/gradient-driven loop can be executed efficiently as part of the input preprocessing stage, supporting real-time or production-scale deployment scenarios.

6. Limitations and Extensions

Classifier Dependency: The effectiveness and selectivity of PromptSan-Modify are limited by the accuracy and scope of the underlying text NSFW classifier. A weak or poorly calibrated classifier may miss edge cases or over-sanitize benign prompts.
Potential for Adversarial Prompts: The method can be bypassed if the prompt is sufficiently obfuscated or adversarial, at which point improvements in the classifier or additional detection layers may be necessary.
Non-Textual Risks: While the approach addresses harmful semantics presented in text, imagery resulting from ambiguous prompts or multimodal cues may still require downstream image-level filtering as a complement.
Computational Overhead: Iterative gradient computation per prompt introduces marginal overhead, but with efficient implementations, this remains practical at inference time for most applications.

7. Significance and Prospects

PromptSan-Modify represents a practical, classifier-guided, inference-time prompt sanitization technique for safeguarding content in modern text-to-image generation pipelines. It advances the state of the art in balancing content moderation with creative flexibility, offering modular deployment for real-world open-ended T2I services. Its integration of classifier-driven, token-level, gradient-informed editing marks a substantive step towards ethical, usable generative AI (Xie et al., 23 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation (2025)

Follow Topic

Get notified by email when new papers are published related to PromptSan-Modify.