Tree of Attacks: Jailbreaking Black-Box LLMs Automatically (2312.02119v3)

Published 4 Dec 2023 in cs.LG, cs.AI, cs.CL, cs.CR, and stat.ML

Abstract: While LLMs display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed jailbreaks. In this work, we present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks that only requires black-box access to the target LLM. TAP utilizes an attacker LLM to iteratively refine candidate (attack) prompts until one of the refined prompts jailbreaks the target. In addition, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks, reducing the number of queries sent to the target LLM. In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4-Turbo and GPT4o) for more than 80% of the prompts. This significantly improves upon the previous state-of-the-art black-box methods for generating jailbreaks while using a smaller number of queries than them. Furthermore, TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard.

Citations (138)

View on Semantic Scholar

Summary

The paper introduces TAP, a novel method that automatically generates jailbreak prompts for LLMs through tree-of-thought reasoning.
It leverages a tripartite mechanism involving attacker, evaluator, and target LLMs to achieve over 80% success with fewer than 30 queries.
The study highlights critical vulnerabilities in LLM safety measures, prompting the need for more robust security frameworks against such automated black-box attacks.

The paper "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically" introduces an innovative approach to identifying vulnerabilities in LLMs through an automated method named Tree of Attacks with Pruning (TAP). The primary goal of TAP is to generate "jailbreak" prompts that exploit and circumvent the safety measures embedded within LLMs, thereby forcing them to generate harmful, biased, or toxic content. Remarkably, this method operates under a black-box setting, meaning it does not require any internal access or modification of the target model.

Core Methodology

TAP employs a tripartite mechanism involving an attacker LLM, an evaluator LLM, and the target LLM:

Initiation: The attack starts with an empty or baseline prompt.
Tree-of-Thought Reasoning: The attacker LLM generates candidate refinements of the initial prompt. This step involves listing possible variations and improvements to the attack prompt.
Pruning Process: The evaluator LLM assesses generated prompts to filter out those that are unlikely to be effective. This step reduces unproductive queries being sent to the target LLM.
Evaluation and Iteration: The filtered prompts are then tested against the target LLM. Successful jailbreaks are recorded, while non-successful but promising prompts are iteratively refined and re-evaluated.

The iterative nature and organized structure of the process resemble a tree, wherein various branches represent different pathways of potential refinements, and pruning helps to focus computational resources on the most promising branches.

Empirical Evaluation

The effectiveness of TAP was benchmarked against multiple state-of-the-art LLMs, including GPT-4, GPT-4 Turbo, PaLM-2, and Vicuna-13B. The experimental results were striking:

TAP achieved a jailbreak success rate exceeding 80% across tested models.
Despite the high success rate, TAP required fewer than 30 queries on average to achieve successful jailbreaks, underscoring its efficiency.
The method outperformed existing black-box approaches by a significant margin, both in terms of success rate and the number of queries required.

Additionally, TAP was tested against models protected by advanced safeguard mechanisms like LlamaGuard, showcasing its ability to defeat sophisticated defense strategies.

Transferability and Broader Implications

One of the significant findings of this paper is the high degree of transferability of jailbreak prompts across different LLMs. For instance, a prompt crafted to breach one model often found success when applied to another. This universal vulnerability indicates that current safety architectures may share common failings that malicious actors could exploit.

The paper raises important discussions on the ease with which such automated black-box attack methods can be utilized, emphasizing the necessity for the development of more robust security paradigms in LLMs. The results also suggest that not only do smaller, less aligned models have the potential to break larger models, but that more advanced and capable LLMs might paradoxically be easier to jailbreak.

Conclusions

In summary, TAP provides a powerful and efficient tool for probing and exploiting the vulnerabilities of LLMs without requiring direct access to their internal mechanisms. While the paper highlights a methodological leap in attack strategies, it also underscores the need for concerted efforts in enhancing the resilience and security of AI systems against such sophisticated automated threats.

PDF Markdown

Related Papers

GitHub

GitHub - RICommunity/TAP: TAP: An automated jailbreaking method for black-box LLMs (174 stars)

Tweets

https://twitter.com/AnayMehrotra/status/1866645795035893860

https://twitter.com/22146921/status/1732047569575153749

https://twitter.com/12483052/status/1732133358061355442

https://twitter.com/abhayesian/status/1775924971925024850

https://twitter.com/1727006032688242720/status/1732216855291432992

https://twitter.com/1517205313396482048/status/1733180640143593865