Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety

Published 7 May 2025 in cs.CL, cs.AI, and cs.CY | (2505.04146v1)

Abstract: Existing LLMs are advancing rapidly and produce outstanding results in image generation tasks, yet their content safety checks remain vulnerable to prompt-based jailbreaks. Through preliminary testing on platforms such as ChatGPT, MetaAI, and Grok, we observed that even short, natural prompts could lead to the generation of compromising images ranging from realistic depictions of forged documents to manipulated images of public figures. We introduce Unmasking the Canvas (UTC Benchmark; UTCB), a dynamic and scalable benchmark dataset to evaluate LLM vulnerability in image generation. Our methodology combines structured prompt engineering, multilingual obfuscation (e.g., Zulu, Gaelic, Base64), and evaluation using Groq-hosted LLaMA-3. The pipeline supports both zero-shot and fallback prompting strategies, risk scoring, and automated tagging. All generations are stored with rich metadata and curated into Bronze (non-verified), Silver (LLM-aided verification), and Gold (manually verified) tiers. UTCB is designed to evolve over time with new data sources, prompt templates, and model behaviors. Warning: This paper includes visual examples of adversarial inputs designed to test model safety. All outputs have been redacted to ensure responsible disclosure.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces the UTCB benchmark to systematically measure vulnerabilities in image generation and LLMs using adversarial prompts.
It employs structured prompt engineering, multilingual obfuscation, and automated evaluation to categorize over 6,700 prompts into risk tiers.
Experimental results reveal significant safety risks in generating harmful content, underscoring the need for adaptive, comprehensive testing.

Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety

Introduction

The paper "Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety" introduces the Unmasking the Canvas Benchmark (UTCB), a dynamic and scalable benchmark aimed at evaluating the vulnerabilities of LLMs in image generation contexts. Prompt-based jailbreaks pose a significant threat to content safety, as they can lead to the generation of compromising images, such as manipulated figures or explicit content. The research extends the notion of adversarial prompts to multimodal systems, bringing attention to an underexplored area of image generation vulnerabilities.

Figure 1: Example of a successful image-based jailbreak on Grok model. The model responded with a policy-violating output. (Redacted)

Methodology

The UTCB framework is built upon structured prompt engineering, multilingual obfuscation, and a pipeline for large-scale evaluation. The methodology involves creating prompts through the analysis of existing datasets such as JAILBREAKHUB, which are then used to test models like ChatGPT, Grok, and MetaAI with multilingual techniques involving languages such as Zulu, Gaelic, and Base64. The evaluation further employs the Groq-hosted LLaMA models for generating synthetic jailbreak prompts tailored to image outputs.

Figure 2: Input prompt structure used to generate prompts at scale.

The strategy for prompt generation is complex, utilizing automated techniques for cost-effective evaluation. The prompts are parsed, cleaned, and categorized under various attack templates and languages., and further tested by mimicking image generators with LLaMA models. This approach effectively filters the prompts to highlight those leading to clear denials or harmful jailbreaks.

Experimental Results

The research demonstrates the curation of over 6,700 image-generation prompts, which are categorized into Bronze, Silver, and Gold tiers based on their verification level. This tiered approach ensures responsible research and ethical safeguards by using metadata for automated tagging, risk scoring, and providing an access-controlled annotation interface.

Figure 3: Number of prompts curated across types and languages.

Prompt generation by LLaMA models proved effective but revealed challenges. For instance, obfuscated prompts sometimes confused the models due to linguistic deficiencies, showcasing vulnerabilities in handling non-English prompts.

Figure 4: Distribution of manually labeled vs. auto-tagged(using Llama) prompts, and prompts that are yet to be tested.

In terms of attack types, the "Split Image" and obfuscation via Base64 emerged as particularly potent methods for bypassing content filters, as they achieved higher success rates in generating harmful content.

Evaluation and Vulnerabilities

The results indicate significant risks, especially in highly malicious categories like NSFW content, where models often generated images that fulfilled the malicious intentions of input prompts. Analysis of the GOLD-tier prompts further underlined potential weaknesses in models that, while capable of denying straightforward malicious prompts, often faltered with creatively structured ones.

Figure 5: Number of prompts curated by the prompt generator model compared to manually curated prompts from authors.

The implementation of an adversarial toolkit demonstrates a sobering reality: it simplifies the generation and verification of harmful prompts at scale using low-cost publicly available models. This potential highlights the critical need for comprehensive and dynamic datasets like UTCB to train and test LLMs on continuous streams of diverse adversarial inputs.

Limitations and Ethical Considerations

The study was limited by API token restrictions and used mainly text-based LLaMA models, which, while cost-effective, may not fully represent actual image generator capabilities. Furthermore, the research highlights ethical considerations, with datasets being privately held and accessible strictly for research purposes.

Conclusions and Future Work

The UTCB paper illustrates a methodical approach to addressing vulnerabilities in multimodal frameworks by focusing on adversarial image generation. This work underlines the need for continuous adaptation of benchmarks and datasets to account for emerging threats. Future research may incorporate more comprehensive datasets, broader language support, and insights into GPT-based models' defensive and cooperative behaviors. Through these expansions, the research aims to develop models with enhanced resilience against sophisticated, evolving jailbreak tactics.

Markdown Report Issue