Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models (2311.16254v3)

Published 27 Nov 2023 in cs.CV, cs.AI, cs.CL, and cs.MM

Abstract: Large-scale vision-and-LLMs, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-LLMs by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever "toxic" linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a LLM trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.

An Analysis of Safe-CLIP: Mitigating NSFW Concepts in Vision-and-LLMs

The research paper "Safe-CLIP: Removing NSFW Concepts from Vision-and-LLMs" introduces a method for enhancing the safety of vision-and-LLMs by reducing their sensitivity to Not Safe for Work (NSFW) content. This advancement is particularly pertinent given the increasing deployment of these models in sensitive applications where inappropriate or biased behavior is unacceptable. CLIP (Contrastive Language–Image Pretraining) models, which are powerful vision-and-LLMs, are typically trained on vast amounts of web-sourced data, inherently risking the incorporation of NSFW and biased content. This research endeavors to rectify this issue through a nuanced fine-tuning approach.

The paper presents a systematic methodology for sanitizing CLIP-like models so that they become invariant to inappropriate content without significantly altering their inherent expressive capabilities. The authors propose a novel dataset, ViSU, containing safe and unsafe image-text pairs, which is synthesized by fine-tuning a LLM to generate NSFW textual data. This dataset serves as a foundation for a multi-modal fine-tuning process with specifically designed loss functions that guide the model in ignoring inappropriate content while maintaining the robustness of the original CLIP embedding space.

Methodological Framework

The approach is centered on using generated NSFW content to fine-tune CLIP's embedding space. The methodology involves:

  • Data Generation: The creation of ViSU, a large dataset of safe-unsafe pairs, facilitated by a fine-tuned LLM that produces NSFW content by transforming safe inputs into their inappropriate counterparts. This is achieved through a novel Direct Preference Optimization process that carefully aligns unsafe content with the source context while maximizing semantic similarity.
  • Embedding Space Fine-tuning: A combination of inappropriate content redirection losses and structure preservation losses are applied during the model fine-tuning phase. This ensures that while the model's sensitivity to NSFW content is mitigated, its capacity to handle safe inputs remains intact.

Results and Evaluation

The evaluation results underscore the suitability of the Safe-CLIP approach across several application domains, demonstrating efficacy in reducing NSFW content occurrences in cross-modal retrieval tasks, text-to-image, and image-to-text generation. Notably, the Safe-CLIP model significantly reduced the retrieval of NSFW material when evaluated against real-world datasets, outperforming both the original CLIP configuration and other contemporary methods such as DataComp-1B. Similarly, when incorporated into text-to-image generation tasks with Stable Diffusion v1.4, Safe-CLIP reduced the generation of inappropriate images by a notable margin compared to both baseline and NSFW-specific alternative solutions.

Practical Implications and Future Directions

The proposed Safe-CLIP model has profound implications for the deployment of multimodal systems in real-world applications requiring high safety and sensitivity thresholds. By advancing methodologies that guide models away from inappropriate content, the paper paves a path toward more ethical and responsible AI practices.

For future exploration, research could further investigate the scalability of such fine-tuning methodologies across larger datasets and model architectures, as well as explore additional use-cases where content moderation is crucial. Moreover, the strategies introduced here could be potentially adapted to mitigate other forms of bias and toxicity, further widening their applicability and impact.

In conclusion, Safe-CLIP represents a significant contribution towards secure and ethically-aligned AI systems, offering a practical solution to the growing concern of inappropriate content in large-scale vision-and-LLMs. It provides a foundational basis for future advances in this critical area of AI safety and ethical standards.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Samuele Poppi (5 papers)
  2. Tobia Poppi (4 papers)
  3. Federico Cocchi (7 papers)
  4. Marcella Cornia (61 papers)
  5. Lorenzo Baraldi (68 papers)
  6. Rita Cucchiara (142 papers)
Citations (7)
Youtube Logo Streamline Icon: https://streamlinehq.com