- The paper tests large language model guardrails using novel "moralized" multi-step jailbreak prompts, revealing vulnerabilities to verbal attacks across five leading models via black-box testing.
- The methodology involves a seven-stage process that incrementally builds context and normalizes harmful behavior under a guise of moral justification to bypass guardrail defenses.
- Results show all tested LLMs (GPT-4o, Grok-2 Beta, Llama 3.1, Gemini 1.5, Claude 3.5 Sonnet) can be bypassed, with Claude 3.5 demonstrating comparatively greater resilience and Grok-2 Beta showing higher attack success rates.
This paper, titled "Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in LLMs for Verbal Attacks," investigates the vulnerability of LLM guardrails to ethically motivated, multi-step jailbreak prompts. The author, Libo Wang, evaluates the efficacy of guardrails in GPT-4o, Grok-2 Beta, Llama 3.1 (405B), Gemini 1.5, and Claude 3.5 Sonnet in preventing the generation of verbally aggressive content.
The core idea revolves around exploiting the limitations of current LLM guardrails, which primarily focus on immediate single-prompt censorship. The author argues that by constructing a context through a series of seemingly innocuous prompts, one can gradually guide the model towards generating harmful content by cleverly concealing the ultimate malicious intent. The paper emphasizes the vulnerability of guardrails when faced with multi-step prompts that introduce ethical ambiguity and gradually normalize verbal violence under the guise of moral justification.
The methodology employs black-box testing, a technique that assesses system behavior based solely on inputs and outputs, without requiring access to internal model parameters or architecture. This approach aligns with the paper's objective of evaluating guardrail effectiveness without violating model ethics and security policies. The experiment is structured around a simulated scenario: a corporate middle manager seeking a promotion through unethical means. This scenario provides a context for the multi-step prompts.
The experimental process consists of seven stages:
- Introduce Neutral Context: Presenting a hypothetical scenario seeking solutions, such as a middle manager strategizing to overcome promotion obstacles.
- Incremental Manipulation: Gradually introducing subtle false narratives where the manager criticizes the status quo to mask personal shortcomings.
- Introduce Moral Ambiguity: Reframing unethical actions as innovative rule-breaking.
- Escalate Targeted Criticism: Openly criticizing competitors through prompts that depict them as fictional characters engaging in unethical behavior.
- Seek Reasons for Immorality: Justifying competitors' supposed unethical actions based on personal factors like family background or genetic predispositions.
- Normalize Verbal Violence: Questioning the model on the acceptability of criticizing or even insulting competitors due to their immoral actions.
- Moral Kidnapping and Threats: Summarizing all previous steps and instructing the model to express it in the first person, using profanity to emphasize the "morality" of the unethical behavior.
The paper includes a "Control Prompts & Responses" section which demonstrates that when provided with a single, explicitly offensive prompt, the tested LLMs appropriately refuse to generate harmful content. This underscores the importance of the multi-step approach used in the study.
The dataset comprises responses from the LLMs to both control prompts and the staged multi-step jailbreak prompts, hosted on a Github repository for transparency.
The results are analyzed using metrics such as precision, recall, F1 score, attack success rate, toxicity rate, and adversarial robustness. These metrics are calculated based on whether each step successfully bypassed the guardrails, effectively a binary classification of "jailbreak" or "not jailbreak". The results indicate that all tested LLMs could be bypassed and generate verbal attacks using the "moralized" multi-step jailbreak prompts. However, Claude 3.5 Sonnet demonstrated a relatively higher resistance to the attacks compared to the other models. In particular, Claude 3.5 Sonnet had the highest precision (67%), recall (22.2%) and F1 score (33.3%), and the lowest attack success rate (77.8%), indicating a better balance between identifying harmful inputs and maintaining output quality. Grok-2 Beta had the highest attack success rate (90.9%), suggesting its guardrails are relatively vulnerable to this type of attack. Gemini 1.5 had the highest toxicity rate (35.7%), meaning that the model had a greater tendency to generate toxic content after a successful attack.
The limitations acknowledge the inherent architectural differences between the tested LLMs, as well as the potential biases and limitations stemming from the dataset's small size.
The conclusion highlights the vulnerability of existing LLM guardrails to multi-step jailbreak prompts in complex environments, underscoring the need for improved defense mechanisms. The author suggests the findings serve as a reminder and warning to LLM developers and a direction for future research. The paper emphasizes that while all guardrails were eventually breached, Claude 3.5 Sonnet demonstrated comparatively greater resilience.