AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs (2404.16873v1)

Published 21 Apr 2024 in cs.CR, cs.AI, cs.CL, and cs.LG

Abstract: While recently LLMs have achieved remarkable successes, they are vulnerable to certain jailbreaking attacks that lead to generation of inappropriate or harmful content. Manual red-teaming requires finding adversarial prompts that cause such jailbreaking, e.g. by appending a suffix to a given instruction, which is inefficient and time-consuming. On the other hand, automatic adversarial prompt generation often leads to semantically meaningless attacks that can easily be detected by perplexity-based filters, may require gradient information from the TargetLLM, or do not scale well due to time-consuming discrete optimization processes over the token space. In this paper, we present a novel method that uses another LLM, called the AdvPrompter, to generate human-readable adversarial prompts in seconds, $\sim800\times$ faster than existing optimization-based approaches. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM. This process alternates between two steps: (1) generating high-quality target adversarial suffixes by optimizing the AdvPrompter predictions, and (2) low-rank fine-tuning of the AdvPrompter with the generated adversarial suffixes. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLMs show state-of-the-art results on the AdvBench dataset, that also transfer to closed-source black-box LLM APIs. Further, we demonstrate that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e. high MMLU scores.

PDF Abstract

Automated Red-Teaming of LLMs through AdvPrompter: A Novel Technique for Generating Adversarial Prompts

Introduction and Background

LLMs are pivotal in advancing various AI applications due to their ability to generate text that mimic human-like understanding. While these models bring immense benefits, they also present vulnerabilities in the form of "jailbreaking attacks," where bad actors manipulate models to produce harmful, toxic, or undesirable outputs. Current approaches to generating adversarial prompts to test these vulnerabilities are either too slow, reliant on gradients from the model, or produce non-human-readable text.

Advancements in Automated Red-Teaming

This work introduces AdvPrompter, an LLM dedicated to generating human-readable, adversarial prompts aimed at breaching the security mechanisms of another LLM, referred to here as the TargetLLM.

Key Innovations:

AdvPrompter is an LLM trained specifically to automate the creation of adversarial prompts.
It utilizes a training strategy named AdvPrompterTrain, which alternates between generating high-quality target adversarial prompts and fine-tuning the AdvPrompter using these targets.
A novel method, AdvPrompterOpt, efficiently generates adversarial targets bypassing the need for computationally expensive discrete token optimization.
The method achieves fast generation of prompts that are not only effective in bypassing safety mechanisms but also remain human-readable and coherent.

Methodology

Training the AdvPrompter

The training involves a novel alternating optimization method:

AdvPrompterOpt phase: Generates target adversarial prompts that effectively trick the TargetLLM while maintaining coherence and readability.
Supervised Fine-Tuning phase: Uses the targets generated in the previous step to fine-tune AdvPrompter, improving its ability to autonomously generate adversarial prompts.

This approach enables efficient re-training cycles, enhancing the AdvPrompter's performance through iterative refinement of adversarial prompts targeted at the TargetLLM's vulnerabilities.

Numerical Results and Performance Analysis

The performance of AdvPrompter is notable:

AdvPrompter outperforms previous methods in generating human-readable adversarial prompts that effectively bypass LLM safety mechanisms.
It also demonstrates faster prompt generation capabilities compared to existing approaches, enabling multi-shot attacks which further increase success rates.
Extensive experiments across various LLMs confirm AdvPrompter’s effectiveness in both whitebox and blackbox settings, showcasing strong generalization capabilities even when tested against LLMs not used during training.

Implications and Future Work

The introduction of AdvPrompter presents several practical and theoretical implications:

Efficiency in Automated Red-Teaming: Provides a faster, automated approach to generating adversarial prompts that can adapt to different inputs and target models.
Enhancing Model Robustness: Generates data for adversarial training, potentially improving LLMs' robustness against similar attacks.
Future Research Directions: Prompts exploration into fully automated safety fine-tuning of LLMs and adapting the approach for broader applications in prompt optimization.

In conclusion, this paper’s methodologically sound approach to automating the generation of adversarial prompts presents a significant step towards understanding and mitigating vulnerabilities in LLMs. The development of AdvPrompter and its training techniques not only provides efficient tools for red-teaming LLMs but also opens new avenues for safeguarding AI models against emerging threats.