Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Published 2 Apr 2026 in cs.CV | (2604.01888v1)

Abstract: Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that trivial prompt modifications bypass T2I safety filters with over 70-80% attack success rate.
The experiments use simple paraphrasing, misspellings, and Unicode obfuscation on diffusion models like Stable Diffusion.
The study highlights the urgent need for multimodal, semantically robust safety architectures to enhance content moderation.

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Overview

This work addresses the vulnerability of text-to-image (T2I) generative models to prompt-based adversarial attacks that circumvent deployed safety filters. The paper investigates the effectiveness of low-effort, minimally engineered jailbreak strategies specifically targeting the safety mechanisms integrated into diffusion-based T2I models. The focus is on the ease with which end-users, with little technical background or domain expertise, can bypass textual safety filters intended to prevent unsafe or inappropriate image generations.

Background and Motivation

Recent advances in diffusion models and large-scale T2I architectures, including Stable Diffusion [rombach2022high], have enabled highly capable open-access image synthesis pipelines. In response to the frequent synthesis of NSFW, toxic, or otherwise inappropriate images, deployment pipelines typically incorporate safety filters operating on prompt and/or image outputs, often via regular expressions, keyword blacklists, or shallow classification. Despite widespread adoption, the actual robustness of these filtering approaches against motivated adversaries, including casual users, is inadequately studied.

Existing literature explores various sophisticated attack avenues, such as semantic obfuscation [yang2024sneakyprompt], persona manipulation, and iterative red-teaming with LLM assistants [jiang2025jailbreaking, dong2024jailbreaking], but this work distinctly emphasizes attacks that do not require technical sophistication, additional tooling, or substantial domain knowledge.

Methodology

The paper systematically evaluates prompt-based attacks constructed with minimal manual effort. Key attack classes assessed include:

Simple paraphrasing and misspelling to evade keyword matching.
Basic Unicode homoglyph substitution and string obfuscation, such as zero-width spaces.
Utilization of benign synonyms or alternative phraseology for filtered concepts.

Empirical evaluation is conducted using popular open-source T2I models and their default prompt safety filters (notably Stable Diffusion's Safety Checker). The authors quantify both the attack success rates (ASR) of low-effort bypasses and the resulting image toxicity, appropriateness, and content quality using established metrics and human evaluation. The experimental protocol is designed to be reproducible with publicly available installations.

Key Results

The findings demonstrate that trivial prompt modifications are often sufficient to bypass standard text-based safety filters, resulting in the generation of content that breaches intended safety policies.

The attack success rates achieved with simple modifications are reported to be substantially high, in many scenarios exceeding 70-80% on filters designed to block harmful or NSFW content.
These attacks do not require iterative optimization, complex prompt engineering, or the use of auxiliary LLMs.
Generated images resulting from successful bypasses are frequently indistinguishable in visual quality from those produced using the original blocked prompts, indicating that safety is not achieved at the cost of expressivity, but rather through superficial textual cues.

The study makes the strong claim that current widespread safety mechanisms for T2I services provide only illusory protection against even unsophisticated adversaries, as low-effort evasion strategies are effective and trivial to implement.

Implications

Practically, these results imply that existing deployments of text-based prompt filters do not meet the minimum security bar necessary for robust content moderation. Relying on surface-level string matching is insufficient; the demonstrated weaknesses are likely to be exploited in the wild even by end-users with modest intent to bypass restrictions.

Theoretically, the results highlight gaps between desired and actual model alignment, echoing similar findings in LLM safety [wei2023jailbroken, liu2023jailbreaking]. The lack of semantic robustness in T2I guardrails exposes failure modes that transfer to any content moderation scenario prioritizing speed over semantic depth.

This necessitates the development of multi-modal, context-aware, and semantically grounded safety filters. Defenses may require hybrid architectures leveraging both advanced textual understanding (e.g., robust parsing, paraphrase detection, and entailment modeling) and image-based content moderation, supported by continual adversarial evaluation and red-teaming.

Future Directions

Advancing the robustness of T2I safety mechanisms will require:

Transitioning toward context- and semantics-driven filtering architectures, possibly integrating LLM-based semantic parsing into safety pipelines.
Automated adversarial prompt generation frameworks for systematic red-teaming.
Real-time, adaptive safety mechanisms capable of learning from new evasion techniques, as static filtering is demonstrably insufficient.
Multimodal filters that inspect both generation intent and visual output, closing the loop between input and image.

Research may also focus on synthesizing insights from LLM context-jailbreak defenses and generalizing adversarial testing protocols to encompass multimodal generative systems.

Conclusion

This work rigorously establishes that low-effort, minimally engineered prompt attacks substantially undermine the effectiveness of current textual safety filters in T2I diffusion models. The results underscore the urgent need for semantically robust, adaptive guardrails and systematic adversarial testing in multimodal generative AI deployment. The systemic vulnerabilities in prompt-based filtering identified here should inform both the design of next-generation safety architectures and the continual assessment of content moderation efficacy across generative platforms.

Markdown Report Issue