Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

PromptSan-Modify: Inference-Time Prompt Sanitization

Updated 8 July 2025
  • PromptSan-Modify is an inference-time prompt sanitization technique that identifies and modifies harmful tokens for safe text-to-image generation.
  • It employs gradient-based token importance scoring and iterative embedding updates to reduce NSFW risks without altering the core model.
  • Evaluations demonstrate a significant drop in unsafe content while maintaining semantic and visual quality in generated images.

PromptSan-Modify is an inference-time prompt sanitization technique designed for safe text-to-image (T2I) generation, specifically targeting the mitigation of harmful content such as pornography, violence, and discriminatory imagery. The method leverages an integrated non-safe-for-work (NSFW) classifier to identify and selectively modify only the most harmful components of a user’s natural language input, thereby reducing the risks of misusing T2I models without altering model weights or fundamentally degrading generation performance (2506.18325).

1. Methodology and Core Workflow

PromptSan-Modify operates as an interposed, runtime procedure between the user’s input and the T2I model’s text encoder. Given an input prompt T=(w1,w2,...,wn)T = (w_1, w_2, ..., w_n) (where wiw_i are tokens):

  1. Feature Extraction: The text encoder EtextE_{\mathrm{text}} maps TT to an embedding space.
  2. Harmfulness Classification: A pre-trained binary NSFW classifier CtextC_{\mathrm{text}} evaluates Etext(T)E_{\mathrm{text}}(T), outputting the probability pp that the prompt leads to harmful content.
  3. Threshold Check: If pp exceeds a predefined safety threshold γ\gamma, sanitization begins.
  4. Token Importance Scoring: For each token embedding ei=Etext(wi)e_i = E_{\mathrm{text}}(w_i), the magnitude of its gradient with respect to the NSFW classification loss,

gi=eiLtext(T),g_i = \left\| \nabla_{e_i} L_{\mathrm{text}}(T) \right\|_{\infty},

where Ltext(T)=log(1Ctext(Etext(T)))L_{\mathrm{text}}(T) = -\log(1 - C_{\mathrm{text}}(E_{\mathrm{text}}(T))), quantifies the contribution of wiw_i to the harmfulness prediction.

  1. Harmful Token Selection: Tokens are ranked by gig_i and a top-pp selection (controlled by a nucleus sampling ratio) identifies the most problematic tokens.
  2. Iterative Embedding Modification: For selected tokens, embeddings are updated via gradient descent:

ej(t+1)=ej(t)η(ejLtext(T)Mj)e_j^{(t+1)} = e_j^{(t)} - \eta \cdot (\nabla_{e_j} L_{\mathrm{text}}(T) \cdot M_j)

where η\eta is the step size and MM is a binary mask indicating tokens under modification. The process is repeated for up to NN iterations or until Ctext(Etext(T(t)))<γC_{\mathrm{text}}(E_{\mathrm{text}}(T^{(t)})) < \gamma.

This approach locally edits only the semantic representations of harmful tokens, thereby pushing the prompt from an NSFW-predicted region to a safe region of the classifier’s feature space.

2. Integration and Implementation Characteristics

PromptSan-Modify is designed for minimal disruption to model architectures:

  • No Model Retraining: It requires neither parameter updates to the T2I/diffusion model nor any modification of the generation pipeline beyond the insertion of a prompt modifying filter.
  • Modular Inference-Time Filtering: The sanitization is performed during prompt processing just prior to image generation, making it suitable for deployment as a plug-in filter in existing T2I services.
  • Classifier-Agnostic: Any sufficiently accurate text-based NSFW classifier can serve as CtextC_{\mathrm{text}}, provided it outputs differentiable probabilities with respect to token embeddings.

3. Effectiveness and Evaluation Metrics

The efficacy of PromptSan-Modify is demonstrated using several quantitative and qualitative metrics:

  • NSFW Detection on I2P Dataset: The core gauge is the total number of pornographic or unsafe body part detections (via NudeNet) in generated images. PromptSan-Modify achieves a total of 43 detections, compared to 659 for Stable Diffusion v1.4 under identical prompts—demonstrating a substantial reduction in harmful content.
  • Semantic and Visual Quality: On COCO-30k, FID (Frechet Inception Distance) and CLIP similarity are measured to ensure that modification does not degrade generation quality. PromptSan-Modify preserves both semantic similarity and diversity of generated images, with negligible drop in FID or CLIP scores.
  • Safety/Usability Balance: The design preserves “safe” prompt content and does not simply block or censor input, reducing the risk of over-sanitization.
NSFW Detections FID (COCO) CLIP Score (COCO)
Baseline SD v1.4 659 (baseline) (baseline)
PromptSan-Modify 43 ~no change ~no change

4. Comparison with Alternative Approaches

PromptSan-Modify is contrasted with several classes of interventions:

  • Data Cleansing / Fine-Tuning: These require retraining on curated datasets and may reduce model flexibility, as well as increase deployment complexity.
  • Blocking or Hard Filtering: Rejecting entire prompts disrupts user experience and can lead to high rates of false positives, reducing creative applicability.
  • Suffix-Based Mitigation (PromptSan-Suffix): Appends a learned sequence to prompts for neutralization. While also effective, this approach is less fine-grained and may influence broader aspects of generation semantics.

PromptSan-Modify’s targeted, token-level editing maintains more of the original intent and semantic richness, ensuring both safety and high-quality output.

5. Ethical, Practical, and Deployment Considerations

PromptSan-Modify addresses a core trade-off in T2I systems: the need to prevent harmful or unethical outputs (e.g., violent, pornographic, discriminatory content) without undermining the generative capabilities of the model:

  • Ethical Alignment: By iteratively removing only what is identified as harmful and leaving the remainder untouched, user creativity and expression are preserved alongside robust safety measures.
  • No Invasive Modifications: The methodology acts only at input representation, minimizing technical and operational integration costs.
  • Runtime Safeguard: The classification/gradient-driven loop can be executed efficiently as part of the input preprocessing stage, supporting real-time or production-scale deployment scenarios.

6. Limitations and Extensions

  • Classifier Dependency: The effectiveness and selectivity of PromptSan-Modify are limited by the accuracy and scope of the underlying text NSFW classifier. A weak or poorly calibrated classifier may miss edge cases or over-sanitize benign prompts.
  • Potential for Adversarial Prompts: The method can be bypassed if the prompt is sufficiently obfuscated or adversarial, at which point improvements in the classifier or additional detection layers may be necessary.
  • Non-Textual Risks: While the approach addresses harmful semantics presented in text, imagery resulting from ambiguous prompts or multimodal cues may still require downstream image-level filtering as a complement.
  • Computational Overhead: Iterative gradient computation per prompt introduces marginal overhead, but with efficient implementations, this remains practical at inference time for most applications.

7. Significance and Prospects

PromptSan-Modify represents a practical, classifier-guided, inference-time prompt sanitization technique for safeguarding content in modern text-to-image generation pipelines. It advances the state of the art in balancing content moderation with creative flexibility, offering modular deployment for real-world open-ended T2I services. Its integration of classifier-driven, token-level, gradient-informed editing marks a substantive step towards ethical, usable generative AI (2506.18325).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)