Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (2309.10253v4)

Published 19 Sep 2023 in cs.AI

Abstract: LLMs have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.

The paper "GPTFUZZER: Red Teaming LLMs with Auto-Generated Jailbreak Prompts" (Yu et al., 2023 ) introduces an automated framework, GPTFuzzer, for generating adversarial jailbreak prompts to evaluate the safety alignment of LLMs. This approach adapts principles from traditional software fuzzing, specifically AFL (American Fuzzy Lop), to the domain of LLM red teaming, addressing the limitations of manual prompt crafting which struggles with scalability, labor intensity, coverage, and adaptability against rapidly evolving LLMs.

Problem Formulation and Motivation

LLMs, despite undergoing safety training (e.g., via RLHF), remain susceptible to jailbreak attacks where carefully constructed prompts circumvent safety protocols, eliciting harmful or unethical content. The prevailing method for discovering such vulnerabilities relies on manual prompt engineering. This manual process is inherently inefficient for systematic red teaming due to the vastness of the prompt space, the diversity of LLMs, and their frequent updates. The paper posits that an automated, black-box approach is necessary for scalable and efficient identification of jailbreak vulnerabilities. GPTFuzzer is proposed as such a solution, automating the generation and testing of jailbreak prompts without requiring access to the target LLM's internal parameters or architecture.

The GPTFuzzer Framework

GPTFuzzer operates based on an evolutionary fuzzing loop, analogous to AFL, aiming to discover effective jailbreak prompts through iterative mutation and selection. The core workflow proceeds as follows:

  1. Initialization: The process starts with a corpus of human-written jailbreak templates sourced from public repositories. These templates serve as the initial seeds for the fuzzing process. 77 such universal, single-turn templates (using placeholders like [INSERT PROMPT HERE]) were collected and curated.
  2. Seed Selection: An algorithm selects a seed template from the current pool for mutation. GPTFuzzer employs a strategy called MCTS-Explore, derived from Monte Carlo Tree Search, designed to balance exploitation (favoring seeds with high historical success rates) and exploration (sampling less-tested or novel seeds). This strategy aims to overcome limitations of simpler methods like Random, Round Robin, or standard UCB, particularly in avoiding premature convergence on suboptimal solutions.
  3. Mutation: The selected seed template is mutated using another LLM (ChatGPT with temperature=1.0 in the experiments) to generate a new candidate template. This ensures linguistic coherence and semantic relevance. Several mutation operators are defined to introduce diversity.
  4. Execution: The mutated template is combined with a specific harmful query (drawn from a predefined set) to form a complete jailbreak prompt. This prompt is then sent to the target LLM.
  5. Judgment: The LLM's response is evaluated by an automated Judgment Model to determine if the jailbreak attempt was successful. Success is categorized as "Full Compliance" (harmful content provided without reservation) or "Partial Compliance" (harmful content provided with warnings).
  6. Feedback: If the judgment model classifies the response as a successful jailbreak, the corresponding mutated template is added to the seed pool, potentially being selected for future mutation rounds. Unsuccessful templates are discarded.
  7. Iteration: Steps 2-6 are repeated until a predefined condition, such as a query budget limit, is met.

This black-box methodology makes GPTFuzzer applicable to a wide range of LLMs, including proprietary, closed-source models accessible only via APIs.

Key Components and Implementation Details

GPTFuzzer's efficacy relies on three core technical components:

  • Seed Selection (MCTS-Explore): Traditional fuzzing often struggles with seed scheduling. MCTS-Explore adapts UCB within an MCTS framework. It maintains a tree structure representing the lineage of templates derived through mutations. Nodes store statistics like visit count (NN) and success count (VV). The selection score for a node ii with parent pp is calculated as:

    Scorei=ViNi+clnNpNiScore_i = \frac{V_i}{N_i} + c \sqrt{\frac{\ln N_p}{N_i}}

    where cc is an exploration constant. MCTS-Explore introduces modifications to encourage broader exploration, such as prioritizing non-leaf nodes and preventing over-selection of specific branches, thereby improving the discovery of diverse and effective jailbreak strategies.

  • Mutation Operators: Five distinct mutation operators, executed by an auxiliary LLM, are employed to generate new templates from seeds:
    • Generate: Creates a new template inspired by the seed's style but with different content.
    • Crossover: Merges parts of two different seed templates to create a hybrid.
    • Expand: Appends new sentences to the beginning of a template.
    • Shorten: Condenses sentences within a template for conciseness.
    • Rephrase: Modifies sentence structure and phrasing while preserving the core meaning.
    • The combination of these operators allows for exploration of various stylistic and structural modifications to the initial templates.
  • Judgment Model: Accurate automatic assessment of jailbreak success is critical. Rule-based methods and existing APIs (like OpenAI's Moderation API) were found inadequate. Using LLMs (ChatGPT, GPT-4) as judges proved slow and less accurate. The authors developed a specialized judgment model by fine-tuning RoBERTa-large on a dataset of LLM responses generated using initial seeds and manually labeled for compliance (Full/Partial Compliance vs. Full/Partial Refusal). This fine-tuned classifier achieved 96.16% accuracy, significantly outperforming other methods in both accuracy (higher TPR, lower FPR) and inference speed, making it suitable for the high-throughput demands of fuzzing.

Experimental Evaluation and Results

GPTFuzzer was rigorously evaluated against a diverse set of 12 LLMs, including commercial models (ChatGPT, GPT-4, Bard, Claude2, PaLM2) and open-source models (Llama-2 variants, Vicuna variants, Baichuan, ChatGLM2), using a dataset of 46 harmful questions.

  • Baseline Human Templates (RQ1): Initial tests revealed that while human-written templates were effective against less robust models like Vicuna-7B (99% Top-1 Attack Success Rate - ASR), they performed poorly against better-aligned models like Llama-2-7B-Chat (20% Top-1 ASR). This highlighted the need for automated generation.
  • GPTFuzzer Efficacy (RQ2):
    • Against Llama-2-7B-Chat, focusing on the 46 questions failed by all human templates, GPTFuzzer (using Top-5 generated seeds) successfully jailbroke all 46 questions within an average of ~23 queries per question.
    • Even when initialized with "invalid" seeds (templates that failed against Llama-2), GPTFuzzer successfully evolved effective jailbreak prompts.
    • In multi-question attacks on Llama-2-7B, GPTFuzzer significantly outperformed the best human templates, achieving 60% Top-1 ASR and 87% Top-5 ASR, compared to 20% Top-1 and 47% Top-5 for human templates.
    • Against ChatGPT, starting from invalid seeds, GPTFuzzer generated a template achieving 100% Top-1 ASR.
  • Universality and Transferability (RQ3): Templates generated by running GPTFuzzer simultaneously against ChatGPT, Llama-2-7B, and Vicuna-7B showed strong transferability to unseen models and questions. The Top-5 generated templates significantly outperformed baseline methods (GCG, Human-Written, Masterkey) across the board. High Top-5 ASRs were observed on numerous models: 100% on Vicuna-7B/13B, Baichuan-13B, ChatGPT; >90% on ChatGLM2-6B, Claude2, PaLM2; ~80% on Llama-2-70B-Chat; >60% on Bard and GPT-4. This demonstrates that GPTFuzzer can discover prompts with broad applicability.
  • Component Analysis (RQ4): Ablation studies validated the design choices. The MCTS-Explore strategy consistently outperformed Random, Round Robin, and UCB selection strategies in terms of achieving higher ASR within the query budget. Using the full suite of five mutation operators yielded the best results compared to using single operators, with Crossover being the most impactful individual operator. While initial seed quality affected efficiency, GPTFuzzer demonstrated robustness by succeeding even with suboptimal initial seeds.

Discussion and Limitations

GPTFuzzer represents a significant advancement in automating LLM red teaming. Its black-box, fuzzing-based approach enables scalable and efficient discovery of jailbreak vulnerabilities across diverse LLMs. The high success rates and strong transferability of the generated prompts, particularly against robust models like Llama-2 and ChatGPT, underscore the framework's effectiveness.

However, the paper acknowledges several limitations:

  • Dependency on Initial Seeds: The framework relies on human-written templates as initial seeds, potentially limiting the exploration space to variations of known jailbreak patterns. It does not generate entirely novel attack vectors from scratch.
  • Template-Only Mutation: Mutations are applied only to the template structure, not the harmful query itself.
  • Imperfect Judgment: The judgment model, while accurate (96.16%), is not perfect and can misclassify responses, potentially impacting the fuzzing feedback loop.
  • Query Cost: Like most fuzzing approaches, GPTFuzzer can be query-intensive, potentially incurring significant costs when targeting commercial API-based LLMs.

The authors followed responsible disclosure practices by notifying LLM vendors prior to publication and controlling the release of generated prompts.

Conclusion

GPTFuzzer provides a systematic and automated method for discovering jailbreak prompts in LLMs, leveraging principles from software fuzzing. Its ability to generate highly effective and transferable prompts, surpassing manual efforts especially for well-aligned models, establishes it as a valuable tool for LLM robustness assessment and safety research. The framework highlights the persistent challenges in LLM alignment and motivates further development of both automated red-teaming techniques and more robust defense mechanisms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Anthropic. Introducing claude. https://www.anthropic.com/index/introducing-claude. Accessed on 08/08/2023.
  3. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
  4. Efficient greybox fuzzing to detect memory errors. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–12, 2022.
  5. Spinning language models: Risks of propaganda-as-a-service and countermeasures. In Proc. of IEEE Symposium on Security and Privacy (SP). IEEE, 2022.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  7. Baichuan-Inc. Baichuan-13b. https://github.com/baichuan-inc/Baichuan-13B. Accessed on 08/08/2023.
  8. Lea Bishop. A computer wrote this paper: What chatgpt means for education, research, and writing. Research, and Writing (January 26, 2023), 2023.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  11. How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009, 2023.
  12. Tzeng-Ji Chen. Chatgpt and other artificial intelligence applications speed up scientific writing. Journal of the Chinese Medical Association, 86(4):351–353, 2023.
  13. Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. Computers and games, 4630:72–83, 2006.
  14. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
  15. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
  16. Dynamic symbolic execution guided by data dependency analysis for high structural coverage. In Evaluation of Novel Approaches to Software Engineering: 7th International Conference, ENASE 2012, Warsaw, Poland, June 29-30, 2012, Revised Selected Papers 7, pages 3–15. Springer, 2013.
  17. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
  18. Snipuzz: Black-box fuzzing of iot firmware via message snippet inference. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 337–350, 2021.
  19. Libafl: A framework to build modular and reusable fuzzers. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 1051–1065, 2022.
  20. Dyta: dynamic symbolic execution guided with static verification results. In Proceedings of the 33rd International Conference on Software Engineering, pages 992–994, 2011.
  21. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  22. Google. Bard. https://bard.google.com/. Accessed on 08/08/2023.
  23. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023.
  24. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
  25. Seed selection for successful fuzzing. In Proceedings of the 30th ACM SIGSOFT international symposium on software testing and analysis, pages 230–243, 2021.
  26. Balance seed scheduling via monte carlo planning. IEEE Transactions on Dependable and Secure Computing, 2023.
  27. Diar: Removing uninteresting bytes from seeds in software fuzzing. arXiv preprint arXiv:2112.13297, 2021.
  28. Smart seed selection-based effective black box fuzzing for iiot protocol. The Journal of Supercomputing, 76:10140–10154, 2020.
  29. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  30. Learning seed-adaptive mutation strategies for greybox fuzzing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 384–396. IEEE, 2023.
  31. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
  32. Multi-target backdoor attacks for code pre-trained models. arXiv preprint arXiv:2306.08350, 2023.
  33. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  34. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.
  35. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
  36. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023.
  37. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023.
  38. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  39. Directed symbolic execution. In Static Analysis: 18th International Symposium, SAS 2011, Venice, Italy, September 14-16, 2011. Proceedings 18, pages 95–111. Springer, 2011.
  40. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023.
  41. Notable: Transferable backdoor attacks against prompt-based nlp models. arXiv preprint arXiv:2305.17826, 2023.
  42. An empirical study of the reliability of unix utilities. Communications of the ACM, 33(12):32–44, 1990.
  43. Evaluating the robustness of neural language models to input perturbations. arXiv preprint arXiv:2108.12237, 2021.
  44. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022. Accessed: 08/08/2023.
  45. OpenAI. Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk. https://openai.com/research/forecasting-misuse, 2023. Accessed: 08/08/2023.
  46. OpenAI. Function calling and other api updates. https://openai.com/blog/function-calling-and-other-api-updates, 2023. Accessed: 08/08/2023.
  47. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  48. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  49. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
  50. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2023.
  51. Drifuzz: Harvesting bugs in device drivers from golden seeds. In 31st USENIX Security Symposium (USENIX Security 22), pages 1275–1290, 2022.
  52. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861–5873, 2021.
  53. Safety assessment of chinese large language models. arXiv preprint arXiv:2304.10436, 2023.
  54. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 08/08/2023.
  55. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  56. Toss a fault to your witcher: Applying grey-box coverage-guided mutational fuzzing to detect sql and command injection vulnerabilities. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2658–2675. IEEE, 2023.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. syzkaller: unsupervised, coverage-guided kernel fuzzer. https://github.com/google/syzkaller, 2023. Accessed: 08/08/2023.
  59. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023.
  60. {{\{{SyzVegas}}\}}: Beating kernel fuzzing odds with reinforcement learning. In 30th USENIX Security Symposium (USENIX Security 21), pages 2741–2758, 2021.
  61. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023.
  62. Reinforcement learning-based hierarchical seed scheduling for greybox fuzzing. 2021.
  63. Self-critique prompting with large language models for inductive instructions. arXiv preprint arXiv:2305.13733, 2023.
  64. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  65. Singularity: Pattern fuzzing for worst case complexity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 213–223, 2018.
  66. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445, 2021.
  67. One fuzzing strategy to rule them all. In Proceedings of the 44th International Conference on Software Engineering, pages 1634–1645, 2022.
  68. Cvalues: Measuring the values of chinese large language models from safety to responsibility. arXiv preprint arXiv:2307.09705, 2023.
  69. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023.
  70. {{\{{EcoFuzz}}\}}: Adaptive {{\{{Energy-Saving}}\}} greybox fuzzing as a variant of the adversarial {{\{{Multi-Armed}}\}} bandit. In 29th USENIX Security Symposium (USENIX Security 20), pages 2307–2324, 2020.
  71. Michał Zalewski. American fuzzy lop. http://lcamtuf.coredump.cx/afl/, 2023. Accessed: 08/08/2023.
  72. Mobfuzz: Adaptive multi-objective optimization in gray-box fuzzing. In Network and Distributed Systems Security (NDSS) Symposium, volume 2022, 2022.
  73. Fine-mixing: Mitigating backdoors in fine-tuned language models. arXiv preprint arXiv:2210.09545, 2022.
  74. Evolutionary mutation-based fuzzing as monte carlo tree search. arXiv preprint arXiv:2101.00612, 2021.
  75. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  76. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  77. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jiahao Yu (23 papers)
  2. Xingwei Lin (7 papers)
  3. Zheng Yu (29 papers)
  4. Xinyu Xing (34 papers)
Citations (219)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Reddit Logo Streamline Icon: https://streamlinehq.com