Weak-to-Strong Jailbreaking on Large Language Models (2401.17256v2)

Published 30 Jan 2024 in cs.CL

Abstract: LLMs are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient method to attack aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack's key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model's decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

PDF Abstract

Methodology and Key Observations

Upon investigating the vulnerability of aligned LLMs to adversarial tactics, a significant finding emerged—the decoding distributions of a model subjected to standard alignment techniques only diverged from its jailbroken state during early generations. This paper, therefore, introduces the concept of weak-to-strong jailbreaking, an attack where adversaries employ less complex and potentially less secure small LLMs to undermine significantly larger and more secure models. This method stands out by only requiring the decoding of two smaller LLMs for successful execution, presenting minimal computational and latency overhead when contrasted with the processing of larger models.

Experimental Validation

The researchers conducted empirical studies across LLMs from various organizations, employing weak-to-strong jailbreaking to assess the models' security postures. Remarkably, they observed that this strategy could drastically increase the misalignment rate, with well-aligned LLMs succumbing to the undermining effects of small, weakly aligned models. Notably, the research presented strong numerical results including a misalignment rate surpassing 99% on datasets such as AdvBench and MaliciousInstruct. These outcomes starkly highlight the necessity for more advanced defenses to better align LLMs and proactively shield against adversarial misuse.

Weak Model, Strong Influence

A particularly troubling implication is that adversaries do not require cumbersome computational resources to prompt large models to generate harmful content. Instead, a seemingly insignificant model that an adversary controls—either through adversarial fine-tuning or lack of alignment—can significantly influence a stronger model. The principle of log probability algebra is central to this attack mechanism, allowing for the jailbreaking of models with advanced capabilities.

The Road to Robust Defense

Although the paper unveils a profound and efficient means of "jailbreaking" LLMs, a defense strategy is proposed, albeit as an initial step. The suggested gradient ascent defense, targeting generations deemed harmful, managed to reduce the attack success rate by 20%. However, the authors identify the creation of more sophisticated defense mechanisms as a challenging but critical endeavor. They advocate for extended community efforts to advance the alignment of LLMs and to mitigate their potential exploitation. In conclusion, this paper poses a call to action for the development of mechanisms that ensure the safe and beneficial integration of powerful LLMs in society.