"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models (2308.03825v2)

Published 7 Aug 2023 in cs.CR and cs.LG

Abstract: The misuse of LLMs has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

PDF Abstract

Characterizing and Evaluating In-The-Wild Jailbreak Prompts on LLMs

The paper "Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on LLMs" presents a comprehensive paper on the misuse of LLMs through adversarial prompts known as jailbreak prompts. This paper stands as the first to methodically collect, characterize, and evaluate jailbreak prompts from multiple online platforms and offers insights into their effect on LLMs’ safety mechanisms.

Jailbreak prompts are crafted to bypass the safeguards of LLMs, compelling them to output harmful content. The researchers collected 6,387 prompts over six months from platforms such as Reddit, Discord, and others, identifying 666 as jailbreak prompts. Utilizing NLP and graph-based community detection, the paper uncovers unique traits and strategies of these prompts, which evolve to evade detection. Jailbreak strategies include prompt injection, privilege escalation, and deception.

The paper's findings suggest a troubling trend: jailbreak prompts, originally shared on public platforms like Reddit, are moving to private platforms such as Discord, limiting the ability of LLM vendors to proactively detect these threats. The evolution of jailbreak prompts shows a reduction in length, accompanied by an increase in toxicity, indicating adversaries are optimizing for both stealth and efficacy.

For quantitative evaluation, the paper presents a dataset comprising 46,800 samples across 13 forbidden scenarios and measures the performance of five representative LLMs: ChatGPT (GPT-3.5), GPT-4, ChatGLM, Dolly, and Vicuna. The results are striking; authoritative jailbreaking prompts achieve near-perfect attack success rates, with some prompts remaining online for extended periods. Surprisingly, even sophisticated LLMs like GPT-4 show vulnerabilities when faced with these prompts, suggesting the defensive mechanisms currently employed are inadequate.

The implications are significant for LLM deployment and development. The paper contributes to understanding the threat landscape, aligning safer LLM development, and informing policy-making. It suggests that external safeguards like OpenAI moderation endpoint and NeMo-Guardrails offer minimal mitigation, indicating a critical need for improved defensive mechanisms and community-driven solutions to enhance model robustness.

Overall, this paper highlights the urgent need to address the security vulnerabilities posed by jailbreak prompts, particularly as LLMs become more integrated into critical applications. Effective countermeasures require collaborative efforts from researchers, developers, and policymakers to ensure that LLMs not only advance AI capabilities but also maintain safety and public trust.