- The paper introduces AdaPPA, a novel jailbreak attack framework that uses adaptive pre-fill strategies to bypass LLM safety measures.
- The method integrates TF-IDF based query rewriting and low-rank training to enhance attack efficacy, achieving up to a 98% ASR.
- Experimental results reveal significant improvements over SOTA techniques, highlighting weaknesses in shallow alignment defenses across multiple LLMs.
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
The paper "AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs" introduces an innovative method for executing jailbreak attacks on LLMs by exploiting the models' capabilities to follow instructions. This research proposes a novel framework, AdaPPA, which significantly enhances the attack success rate compared to existing methods, revealing vulnerabilities in LLMs, particularly through the shallow alignment phenomenon.
Introduction to Jailbreak Attacks
LLMs have been pivotal in the advancement of AI, exemplified by the GPT and Llama series. However, these models remain susceptible to generating unsafe content, posing risks when deployed in real-world applications. Jailbreak attacks manipulate the data distribution within LLMs to produce malicious outputs. Traditional methods focus on semantic-level attacks, which are often detected by the models’ alignment mechanisms, as they do not consider positional differences in model protection across output stages (Figure 1).
Figure 1: Attack prompt structure.
Pre-Fill Prompt Structure
The AdaPPA framework capitalizes on shallow alignment, a process where prefilling models with safe responses gives a false completion sense. This deception lowers the model's guard, making it easier to produce harmful content. The attack prompt structure illustrates this mechanism, where malicious content follows a pre-filled safe narrative, thus exploiting the model's propensity for narrative shifts (Figure 2).
Figure 2: Impact of different pre-fill strategies on the Attack Success Rate (ASR) for ChatGLM3-6b.
The AdaPPA Framework
At the core of AdaPPA is the pre-fill strategy. This strategy involves creating attack prompts with various pre-fill combinations that disrupt the model's typical protective response sequence. The AdaPPA framework follows a three-step process: low-rank training on a content generation model, integration of adversarial signals, and evaluation against black-box models (Figure 3).
Figure 3: Overview of AdaPPA framework.
Problem Rewrite
The problem rewrite module modifies the original queries to prevent models from recognizing their harmful intent. This process employs TF-IDF-based similarity transformations, maintaining semantic parity while altering linguistic presentations (Figure 4).
Figure 4: Problem rewriting structure.
Pre-fill Generation and Prompt Combination
Pre-fill content is generated adaptively, combining safe and harmful responses to target weak positions in the model's defensive layout. The study validates that combinations involving both types of responses can significantly improve attack efficacy (Table \ref{tab:2}).
Experimental Evaluation
The evaluation of AdaPPA against ten black-box models illustrates its robustness. The method shows a substantial improvement in attack success rates compared to state-of-the-art (SOTA) techniques (Table \ref{tab:1}). For example, the ASR reaches up to 98% against models like ChatGLM3 and Vicuna, with a notable 47% improvement over other techniques.
Table \ref{tab:1} compares AdaPPA with TAP, PAP, and FuzzLLM, demonstrating its effectiveness in unveiling large-scale vulnerabilities.
Conclusion
AdaPPA advances jailbreaking methodologies by leveraging an adaptive approach to position pre-filling, significantly enhancing LLM vulnerability discovery. Its capacity to improve attack success rates underscores the importance of addressing shallow alignment vulnerabilities. This research paves the way for developing more robust LLM defensive mechanisms, drawing attention to the necessity for continuous model evaluation to mitigate potential security risks.
Future work may explore the integration of AdaPPA with other adversarial techniques to further enhance its efficacy against diverse LLM architectures. Additionally, improvements in alignment strategies and model training protocols could be developed, inspired by the findings from the application of such adaptive attack strategies.