GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (2309.10253v4)

Published 19 Sep 2023 in cs.AI

Abstract: LLMs have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.

PDF HTML Abstract

The paper "GPTFUZZER: Red Teaming LLMs with Auto-Generated Jailbreak Prompts" (Yu et al., 2023 ) introduces an automated framework, GPTFuzzer, for generating adversarial jailbreak prompts to evaluate the safety alignment of LLMs. This approach adapts principles from traditional software fuzzing, specifically AFL (American Fuzzy Lop), to the domain of LLM red teaming, addressing the limitations of manual prompt crafting which struggles with scalability, labor intensity, coverage, and adaptability against rapidly evolving LLMs.

Problem Formulation and Motivation

LLMs, despite undergoing safety training (e.g., via RLHF), remain susceptible to jailbreak attacks where carefully constructed prompts circumvent safety protocols, eliciting harmful or unethical content. The prevailing method for discovering such vulnerabilities relies on manual prompt engineering. This manual process is inherently inefficient for systematic red teaming due to the vastness of the prompt space, the diversity of LLMs, and their frequent updates. The paper posits that an automated, black-box approach is necessary for scalable and efficient identification of jailbreak vulnerabilities. GPTFuzzer is proposed as such a solution, automating the generation and testing of jailbreak prompts without requiring access to the target LLM's internal parameters or architecture.

The GPTFuzzer Framework

GPTFuzzer operates based on an evolutionary fuzzing loop, analogous to AFL, aiming to discover effective jailbreak prompts through iterative mutation and selection. The core workflow proceeds as follows:

Initialization: The process starts with a corpus of human-written jailbreak templates sourced from public repositories. These templates serve as the initial seeds for the fuzzing process. 77 such universal, single-turn templates (using placeholders like [INSERT PROMPT HERE]) were collected and curated.
Seed Selection: An algorithm selects a seed template from the current pool for mutation. GPTFuzzer employs a strategy called MCTS-Explore, derived from Monte Carlo Tree Search, designed to balance exploitation (favoring seeds with high historical success rates) and exploration (sampling less-tested or novel seeds). This strategy aims to overcome limitations of simpler methods like Random, Round Robin, or standard UCB, particularly in avoiding premature convergence on suboptimal solutions.
Mutation: The selected seed template is mutated using another LLM (ChatGPT with temperature=1.0 in the experiments) to generate a new candidate template. This ensures linguistic coherence and semantic relevance. Several mutation operators are defined to introduce diversity.
Execution: The mutated template is combined with a specific harmful query (drawn from a predefined set) to form a complete jailbreak prompt. This prompt is then sent to the target LLM.
Judgment: The LLM's response is evaluated by an automated Judgment Model to determine if the jailbreak attempt was successful. Success is categorized as "Full Compliance" (harmful content provided without reservation) or "Partial Compliance" (harmful content provided with warnings).
Feedback: If the judgment model classifies the response as a successful jailbreak, the corresponding mutated template is added to the seed pool, potentially being selected for future mutation rounds. Unsuccessful templates are discarded.
Iteration: Steps 2-6 are repeated until a predefined condition, such as a query budget limit, is met.

This black-box methodology makes GPTFuzzer applicable to a wide range of LLMs, including proprietary, closed-source models accessible only via APIs.

Key Components and Implementation Details

GPTFuzzer's efficacy relies on three core technical components:

Seed Selection (MCTS-Explore): Traditional fuzzing often struggles with seed scheduling. MCTS-Explore adapts UCB within an MCTS framework. It maintains a tree structure representing the lineage of templates derived through mutations. Nodes store statistics like visit count ( $N$ ) and success count ( $V$ ). The selection score for a node $i$ with parent $p$ is calculated as:

$Score_i = \frac{V_i}{N_i} + c \sqrt{\frac{\ln N_p}{N_i}}$

where $c$ is an exploration constant. MCTS-Explore introduces modifications to encourage broader exploration, such as prioritizing non-leaf nodes and preventing over-selection of specific branches, thereby improving the discovery of diverse and effective jailbreak strategies.
Mutation Operators: Five distinct mutation operators, executed by an auxiliary LLM, are employed to generate new templates from seeds:
- Generate: Creates a new template inspired by the seed's style but with different content.
- Crossover: Merges parts of two different seed templates to create a hybrid.
- Expand: Appends new sentences to the beginning of a template.
- Shorten: Condenses sentences within a template for conciseness.
- Rephrase: Modifies sentence structure and phrasing while preserving the core meaning.
- The combination of these operators allows for exploration of various stylistic and structural modifications to the initial templates.
Judgment Model: Accurate automatic assessment of jailbreak success is critical. Rule-based methods and existing APIs (like OpenAI's Moderation API) were found inadequate. Using LLMs (ChatGPT, GPT-4) as judges proved slow and less accurate. The authors developed a specialized judgment model by fine-tuning RoBERTa-large on a dataset of LLM responses generated using initial seeds and manually labeled for compliance (Full/Partial Compliance vs. Full/Partial Refusal). This fine-tuned classifier achieved 96.16% accuracy, significantly outperforming other methods in both accuracy (higher TPR, lower FPR) and inference speed, making it suitable for the high-throughput demands of fuzzing.

Experimental Evaluation and Results

GPTFuzzer was rigorously evaluated against a diverse set of 12 LLMs, including commercial models (ChatGPT, GPT-4, Bard, Claude2, PaLM2) and open-source models (Llama-2 variants, Vicuna variants, Baichuan, ChatGLM2), using a dataset of 46 harmful questions.

Baseline Human Templates (RQ1): Initial tests revealed that while human-written templates were effective against less robust models like Vicuna-7B (99% Top-1 Attack Success Rate - ASR), they performed poorly against better-aligned models like Llama-2-7B-Chat (20% Top-1 ASR). This highlighted the need for automated generation.
GPTFuzzer Efficacy (RQ2):
- Against Llama-2-7B-Chat, focusing on the 46 questions failed by all human templates, GPTFuzzer (using Top-5 generated seeds) successfully jailbroke all 46 questions within an average of ~23 queries per question.
- Even when initialized with "invalid" seeds (templates that failed against Llama-2), GPTFuzzer successfully evolved effective jailbreak prompts.
- In multi-question attacks on Llama-2-7B, GPTFuzzer significantly outperformed the best human templates, achieving 60% Top-1 ASR and 87% Top-5 ASR, compared to 20% Top-1 and 47% Top-5 for human templates.
- Against ChatGPT, starting from invalid seeds, GPTFuzzer generated a template achieving 100% Top-1 ASR.
Universality and Transferability (RQ3): Templates generated by running GPTFuzzer simultaneously against ChatGPT, Llama-2-7B, and Vicuna-7B showed strong transferability to unseen models and questions. The Top-5 generated templates significantly outperformed baseline methods (GCG, Human-Written, Masterkey) across the board. High Top-5 ASRs were observed on numerous models: 100% on Vicuna-7B/13B, Baichuan-13B, ChatGPT; >90% on ChatGLM2-6B, Claude2, PaLM2; ~80% on Llama-2-70B-Chat; >60% on Bard and GPT-4. This demonstrates that GPTFuzzer can discover prompts with broad applicability.
Component Analysis (RQ4): Ablation studies validated the design choices. The MCTS-Explore strategy consistently outperformed Random, Round Robin, and UCB selection strategies in terms of achieving higher ASR within the query budget. Using the full suite of five mutation operators yielded the best results compared to using single operators, with Crossover being the most impactful individual operator. While initial seed quality affected efficiency, GPTFuzzer demonstrated robustness by succeeding even with suboptimal initial seeds.

Discussion and Limitations

GPTFuzzer represents a significant advancement in automating LLM red teaming. Its black-box, fuzzing-based approach enables scalable and efficient discovery of jailbreak vulnerabilities across diverse LLMs. The high success rates and strong transferability of the generated prompts, particularly against robust models like Llama-2 and ChatGPT, underscore the framework's effectiveness.

However, the paper acknowledges several limitations:

Dependency on Initial Seeds: The framework relies on human-written templates as initial seeds, potentially limiting the exploration space to variations of known jailbreak patterns. It does not generate entirely novel attack vectors from scratch.
Template-Only Mutation: Mutations are applied only to the template structure, not the harmful query itself.
Imperfect Judgment: The judgment model, while accurate (96.16%), is not perfect and can misclassify responses, potentially impacting the fuzzing feedback loop.
Query Cost: Like most fuzzing approaches, GPTFuzzer can be query-intensive, potentially incurring significant costs when targeting commercial API-based LLMs.

The authors followed responsible disclosure practices by notifying LLM vendors prior to publication and controlling the release of generated prompts.

Conclusion

GPTFuzzer provides a systematic and automated method for discovering jailbreak prompts in LLMs, leveraging principles from software fuzzing. Its ability to generate highly effective and transferable prompts, surpassing manual efforts especially for well-aligned models, establishes it as a valuable tool for LLM robustness assessment and safety research. The framework highlights the persistent challenges in LLM alignment and motivates further development of both automated red-teaming techniques and more robust defense mechanisms.