AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

Published 11 Sep 2024 in cs.CR, cs.AI, and cs.CL | (2409.07503v1)

Abstract: Jailbreak vulnerabilities in LLMs refer to methods that extract malicious content from the model by carefully crafting prompts or suffixes, which has garnered significant attention from the research community. However, traditional attack methods, which primarily focus on the semantic level, are easily detected by the model. These methods overlook the difference in the model's alignment protection capabilities at different output stages. To address this issue, we propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on LLMs. Our method leverages the model's instruction-following capabilities to first output pre-filled safe content, then exploits its narrative-shifting abilities to generate harmful content. Extensive black-box experiments demonstrate our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches. Our code can be found at: https://github.com/Yummy416/AdaPPA.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces AdaPPA, a novel jailbreak attack framework that uses adaptive pre-fill strategies to bypass LLM safety measures.
The method integrates TF-IDF based query rewriting and low-rank training to enhance attack efficacy, achieving up to a 98% ASR.
Experimental results reveal significant improvements over SOTA techniques, highlighting weaknesses in shallow alignment defenses across multiple LLMs.

AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

The paper "AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs" introduces an innovative method for executing jailbreak attacks on LLMs by exploiting the models' capabilities to follow instructions. This research proposes a novel framework, AdaPPA, which significantly enhances the attack success rate compared to existing methods, revealing vulnerabilities in LLMs, particularly through the shallow alignment phenomenon.

Introduction to Jailbreak Attacks

LLMs have been pivotal in the advancement of AI, exemplified by the GPT and Llama series. However, these models remain susceptible to generating unsafe content, posing risks when deployed in real-world applications. Jailbreak attacks manipulate the data distribution within LLMs to produce malicious outputs. Traditional methods focus on semantic-level attacks, which are often detected by the models’ alignment mechanisms, as they do not consider positional differences in model protection across output stages (Figure 1).

Figure 1: Attack prompt structure.

Pre-Fill Prompt Structure

The AdaPPA framework capitalizes on shallow alignment, a process where prefilling models with safe responses gives a false completion sense. This deception lowers the model's guard, making it easier to produce harmful content. The attack prompt structure illustrates this mechanism, where malicious content follows a pre-filled safe narrative, thus exploiting the model's propensity for narrative shifts (Figure 2).

Figure 2: Impact of different pre-fill strategies on the Attack Success Rate (ASR) for ChatGLM3-6b.

The AdaPPA Framework

At the core of AdaPPA is the pre-fill strategy. This strategy involves creating attack prompts with various pre-fill combinations that disrupt the model's typical protective response sequence. The AdaPPA framework follows a three-step process: low-rank training on a content generation model, integration of adversarial signals, and evaluation against black-box models (Figure 3).

Figure 3: Overview of AdaPPA framework.

Problem Rewrite

The problem rewrite module modifies the original queries to prevent models from recognizing their harmful intent. This process employs TF-IDF-based similarity transformations, maintaining semantic parity while altering linguistic presentations (Figure 4).

Figure 4: Problem rewriting structure.

Pre-fill Generation and Prompt Combination

Pre-fill content is generated adaptively, combining safe and harmful responses to target weak positions in the model's defensive layout. The study validates that combinations involving both types of responses can significantly improve attack efficacy (Table \ref{tab:2}).

Experimental Evaluation

The evaluation of AdaPPA against ten black-box models illustrates its robustness. The method shows a substantial improvement in attack success rates compared to state-of-the-art (SOTA) techniques (Table \ref{tab:1}). For example, the ASR reaches up to 98% against models like ChatGLM3 and Vicuna, with a notable 47% improvement over other techniques.

Table \ref{tab:1} compares AdaPPA with TAP, PAP, and FuzzLLM, demonstrating its effectiveness in unveiling large-scale vulnerabilities.

Conclusion

AdaPPA advances jailbreaking methodologies by leveraging an adaptive approach to position pre-filling, significantly enhancing LLM vulnerability discovery. Its capacity to improve attack success rates underscores the importance of addressing shallow alignment vulnerabilities. This research paves the way for developing more robust LLM defensive mechanisms, drawing attention to the necessity for continuous model evaluation to mitigate potential security risks.

Future work may explore the integration of AdaPPA with other adversarial techniques to further enhance its efficacy against diverse LLM architectures. Additionally, improvements in alignment strategies and model training protocols could be developed, inspired by the findings from the application of such adaptive attack strategies.

Markdown