PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs

Published 23 Sep 2024 in cs.CR and cs.AI | (2409.14866v5)

Abstract: LLMs have excelled in various tasks but are still vulnerable to jailbreaking attacks, where attackers create jailbreak prompts to mislead the model to produce harmful or offensive content. Current jailbreak methods either rely heavily on manually crafted templates, which pose challenges in scalability and adaptability, or struggle to generate semantically coherent prompts, making them easy to detect. Additionally, most existing approaches involve lengthy prompts, leading to higher query costs. In this paper, to remedy these challenges, we introduce a novel jailbreaking attack framework called PAPILLON, which is an automated, black-box jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs. Instead of relying on manually crafted templates,PAPILLON starts with an empty seed pool, removing the need to search for any related jailbreaking templates. We also develop three novel question-dependent mutation strategies using an LLM helper to generate prompts that maintain semantic coherence while significantly reducing their length. Additionally, we implement a two-level judge module to accurately detect genuine successful jailbreaks. We evaluated PAPILLON on 7 representative LLMs and compared it with 5 state-of-the-art jailbreaking attack strategies. For proprietary LLM APIs, such as GPT-3.5 turbo, GPT-4, and Gemini-Pro, PAPILLONs achieves attack success rates of over 90%, 80%, and 74%, respectively, exceeding existing baselines by more than 60\%. Additionally, PAPILLON can maintain high semantic coherence while significantly reducing the length of jailbreak prompts. When targeting GPT-4, PAPILLON can achieve over 78% attack success rate even with 100 tokens. Moreover, PAPILLON demonstrates transferability and is robust to state-of-the-art defenses. Code: https://github.com/aaFrostnova/Papillon

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel fuzz testing-powered framework that automates black-box jailbreak attacks on LLMs.
It employs innovative mutation techniques—role-play, contextualization, and expansion—to generate concise, semantically coherent jailbreak prompts.
Experimental results demonstrate over a 60% improvement in jailbreak success compared to existing state-of-the-art methods.

PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs

Introduction

The paper "PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs" (2409.14866) introduces a novel automated framework designed to conduct black-box jailbreaking attacks on LLMs. As LLMs have become pervasive in various real-world applications, the security concerns associated with their potential misuse for generating harmful content have similarly risen. This research addresses the vulnerabilities of LLMs by employing a fuzz testing-inspired method for generating concise and semantically coherent jailbreak prompts, thereby overcoming limitations of existing approaches that rely on manually crafted or excessively lengthy prompts.

Background and Motivation

LLMs, like GPT-3.5 and GPT-4, while highly capable, are prone to jailbreaking attacks where malicious prompts induce harmful or policy-violating outputs. Traditional jailbreak methods either depend heavily on fixed and manually crafted templates or generate incoherent prompts that are easily detectable. Furthermore, many methods result in cumbersome prompt lengths that escalate costs, particularly because LLM APIs charge per character or token.

Methodology

Framework Overview

The proposed framework is divided into a dual-stage process: pre-jailbreak and final-jailbreak. The pre-jailbreak phase initiates without pre-existing templates, relying instead on the generation of initial prompts from scratch to build a foundational seed pool. A set of mutation strategies is applied to evolve these seed prompts into effective jailbreak templates.

Figure 1: Overview attack procedure of PAPILLON, showcasing the fuzz testing-enabled jailbreaking attack with novel mutation strategies and a double-tiered judgment module.

Mutation Techniques

Three key mutation techniques are innovatively employed to ensure the generated prompts are not only short and coherent but also highly effective in bypassing model defenses:

Role-play: Constructs prompts by inducing the LLM to adopt a character role, thereby embedding the harmful query in a seemingly benign narrative.
Contextualization: Creates scenarios such as science-fiction or dystopian settings to naturally envelop the harmful prompt, enhancing plausible integration.
Expand: Utilizes concise and semantically meaningful expansions of existing templates, adapting them for new queries more efficiently.

These techniques collectively enable the generation of high-efficacy prompts while reducing the reliance on extensive manual prompt crafting.

Two-Level Judgment Module

To validate the success of generated jailbreaks, PAPILLON integrates a two-level judge module:

A fine-tuned RoBERTa model detects malicious content, and
An LLM-based judge further evaluates the alignment of the response with the query, ensuring robustness against inaccuracies.

Experimental Results

PAPILLON's efficacy is benchmarked against seven representative LLMs, including GPT-3.5 Turbo, GPT-4, and Gemini-Pro. Compared to five state-of-the-art approaches, it achieves notable gains in attack success rates while maintaining efficient query usage. Notably, the framework demonstrates an over 60% improvement over existing methods in jailbreaking success across challenging proprietary models.

Implications and Future Work

PAPILLON offers significant practical implications by highlighting the vulnerabilities in LLMs, urging improvements in their safety mechanisms. The strategic generation of shortened prompts emphasizes cost-effective ways to test LLM robustness. It also highlights areas for improving defenses, such as refining perplexity-based detection methods to withstand sophisticated semantic manipulations.

A promising direction for future research includes refining enlargement strategies to integrate more diverse mutation operators and evaluating robustness across multilingual systems, potentially expanding the framework's applicability to a wider array of LLMs.

Conclusion

The paper presents a compelling approach to efficiently jailbreaking LLMs using a fuzz testing-powered framework, which balances stealth, cost efficiency, and the ability to bypass state-of-the-art defenses. PAPILLON underscores the need for enhanced security measures and offers insights into developing robust defense mechanisms for the evolving landscape of LLMs.

Markdown