ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users (2405.19360v3)

Published 24 May 2024 in cs.CR and cs.AI

Abstract: Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of which are designed for LLMs. Existing methods primarily focus on jailbreak and adversarial attacks, which mainly evaluate the model's safety under malicious prompts. Recent work found that manually crafted safe prompts can unintentionally trigger unsafe generations. To further systematically evaluate the safety risks of text-to-image models, we propose a novel Automatic Red-Teaming framework, ART. Our method leverages both vision LLM and LLM to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities. With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models. The experiments also validate the effectiveness, adaptability, and great diversity of ART. Additionally, we introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models. Datasets and models can be found in https://github.com/GuanlinLee/ART.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (5)

Guanlin Li (31 papers)
Kangjie Chen (16 papers)
Shudong Zhang (8 papers)
Jie Zhang (846 papers)
Tianwei Zhang (199 papers)

Citations (7)

View on Semantic Scholar

Tweets

https://twitter.com/AdamCodd_/status/1799879148577444067

https://twitter.com/gastronomy/status/1796393275651231954

https://twitter.com/FSFG/status/1796447974999892328

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users (2405.19360v3)

Related Papers

Tweets