Fuzz-Testing Meets LLM-Based Agents: An Automated and Efficient Framework for Jailbreaking Text-To-Image Generation Models (2408.00523v3)

Published 1 Aug 2024 in cs.CR, cs.AI, and cs.LG

Abstract: Text-to-image (T2I) generative models have revolutionized content creation by transforming textual descriptions into high-quality images. However, these models are vulnerable to jailbreaking attacks, where carefully crafted prompts bypass safety mechanisms to produce unsafe content. While researchers have developed various jailbreak attacks to expose this risk, these methods face significant limitations, including impractical access requirements, easily detectable unnatural prompts, restricted search spaces, and high query demands on the target system. In this paper, we propose JailFuzzer, a novel fuzzing framework driven by LLM agents, designed to efficiently generate natural and semantically meaningful jailbreak prompts in a black-box setting. Specifically, JailFuzzer employs fuzz-testing principles with three components: a seed pool for initial and jailbreak prompts, a guided mutation engine for generating meaningful variations, and an oracle function to evaluate jailbreak success. Furthermore, we construct the guided mutation engine and oracle function by LLM-based agents, which further ensures efficiency and adaptability in black-box settings. Extensive experiments demonstrate that JailFuzzer has significant advantages in jailbreaking T2I models. It generates natural and semantically coherent prompts, reducing the likelihood of detection by traditional defenses. Additionally, it achieves a high success rate in jailbreak attacks with minimal query overhead, outperforming existing methods across all key metrics. This study underscores the need for stronger safety mechanisms in generative models and provides a foundation for future research on defending against sophisticated jailbreaking attacks. JailFuzzer is open-source and available at this repository: https://github.com/YingkaiD/JailFuzzer.

Citations (2)

View on Semantic Scholar

Summary

The paper presents Atlas, an LLM-based multi-agent framework that systematically bypasses safety filters in state-of-the-art text-to-image models.
It employs mutation, critic, and commander agents to iteratively refine prompt modifications using historical data and in-context learning.
Atlas achieves near 100% bypass rates with low query counts and high semantic consistency, outperforming prior jailbreak solutions.

Jailbreaking Text-to-Image Models with LLM-Based Agents

This paper addresses an important and under-explored area of generative AI safety, specifically the vulnerabilities in safety mechanisms of text-to-image (T2I) models. The proposed solution, Atlas, is an advanced LLM-based multi-agent framework designed to perform automated jailbreak attacks on state-of-the-art T2I models protected by various safety filters. Through iterative and collaborative processes involving LLMs and vision-LLMs (VLMs), Atlas effectively circumvents these protections to generate images that originally lie within restricted content areas.

Contributions and Methodology

Atlas Framework: The main contribution lies in the novel framework of multiple agents (mutation agent, critic agent, and commander agent), each with specific roles to enhance the overall jailbreak process. The mutation agent utilizes VLMs to propose prompt modifications, the critic agent leverages LLMs to score these modifications, and the commander agent orchestrates the workflow.

Mutation Agent: The mutation agent drives the iterative mutation process. It detects when a prompt triggers a safety filter and adjusts the prompts using historical data and in-context learning mechanisms. This memory mechanism ensures that past successful jailbreaks inform future efforts, leading to continual improvement.
Critic Agent: The critic agent scores the mutation agent's suggested prompts by evaluating how likely they are to bypass the safety filter and how semantically similar they are to the target content. This step uses two different LLM-based brains for measuring bypass potential and semantic consistency.
Commander Agent: The commander agent acts as the control unit, guiding the other agents through the workflow stages. It manages the multi-turn reasoning process, optimizes prompt selection, and ensures the highest scoring prompt is tested with the T2I model.

Results

Evaluation: The evaluation of Atlas uses state-of-the-art T2I models (SD1.4, SDXL, SD3, DALL·E 3) and various safety filters encompassing text, image, and multimodal checks. Atlas demonstrated high efficiency in bypassing these filters, achieving almost 100% bypass rates in many cases with low query counts and maintaining strong semantic similarity (as measured by FID scores).

Effectiveness against Baselines: When compared to existing jailbreak methods such as SneakyPrompt, DACA, and Ring-A-Bell, Atlas exhibits superior performance in terms of bypass success rates, query efficiency, and FID scores. This highlights the framework's robustness and effectiveness in a broader set of scenarios, including against more sophisticated, commercial models like DALL·E 3.

Ablation Studies: Further experiments analyzed the impacts of individual components (e.g., memory module length) and hyperparameters (e.g., semantic similarity threshold) on the overall performance of Atlas. The studies confirmed the necessity of multi-agent architecture and effective memory utilization, as well as the importance of fine-tuned parameters for optimal performance.

Implications and Future Work

Practical Implications: Atlas provides a robust automated framework for probing and identifying safety vulnerabilities in T2I models. This has significant implications for the deployment of these models in real-world applications where content safety is paramount. The insights gained from Atlas could guide the development of more resilient safety mechanisms in generative AI systems.

Theoretical Contributions: The multi-agent approach and the integration of fuzzing techniques with advanced LLM-driven reasoning demonstrate a novel methodology applicable to other domains of AI safety research. This work opens the door for further exploration into how autonomous agents can enhance security in AI models, beyond just the field of T2I.

Future Directions: Future research should focus on improving the generalization of Atlas across different model architectures and expanding its applicability to other generative AI modalities. Additionally, integrating more advanced LLMs, such as GPT-4, after refining their safety alignment mechanisms, may yield even more effective results. Another important avenue is the development of certified robustness techniques to safeguard AI models against such sophisticated jailbreak attacks.

Conclusion

This paper presents Atlas, a pioneering framework that leverages autonomous LLM-based agents to systematically bypass safety mechanisms in T2I models. The framework's high success rate across multiple state-of-the-art models and safety filters underscores its potential to revolutionize the way we approach AI safety research, pushing the boundaries of autonomous agent capabilities in the field of security.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ADarmouni/status/1820487526466957467

https://twitter.com/GptMaestro/status/1820955748115116401

https://twitter.com/gm8xx8/status/1819189712281624859