Attacking Large Language Models with Projected Gradient Descent (2402.09154v1)

Published 14 Feb 2024 in cs.LG

Abstract: Current LLM alignment methods are readily broken through specifically crafted adversarial prompts. While crafting adversarial prompts using discrete optimization is highly effective, such attacks typically use more than 100,000 LLM calls. This high computational cost makes them unsuitable for, e.g., quantitative analyses and adversarial training. To remedy this, we revisit Projected Gradient Descent (PGD) on the continuously relaxed input prompt. Although previous attempts with ordinary gradient-based attacks largely failed, we show that carefully controlling the error introduced by the continuous relaxation tremendously boosts their efficacy. Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.

PDF HTML Abstract

This paper "Attacking LLMs with Projected Gradient Descent" (Geisler et al., 14 Feb 2024 ) introduces a computationally efficient method for crafting adversarial prompts for LLMs by adapting Projected Gradient Descent (PGD). The core idea is to optimize a continuous relaxation of the input token sequence, which allows the use of gradient-based methods, and then project the result back into a space conducive to finding effective discrete token sequences. This approach aims to overcome the high computational cost associated with state-of-the-art discrete optimization methods like GCG [zou_universal_2023].

The problem addressed is the high cost of finding adversarial examples that can "jailbreak" aligned LLMs. Existing effective attacks often require tens or hundreds of thousands of LLM queries, making them expensive for applications like large-scale evaluation or adversarial training. The authors propose using PGD on a continuous representation of the input prompt, which is a common technique in attacking models in continuous domains like image classification.

The practical implementation of this PGD approach involves several key steps:

Continuous Relaxation: The discrete one-hot encoding of the input token sequence $\mathbf{x} \in \mathcal{V}^L$ (where $\mathcal{V}$ is the vocabulary and $L$ is the sequence length) is relaxed to a continuous representation $\tilde{\mathbf{x}} \in [0,1]^{L \times |\mathcal{V}|}$ . This representation satisfies the probabilistic simplex constraint: $\sum_{i=1}^{|\mathcal{V}|} \tilde{x}_{li} = 1$ for each token position $l=1, \dots, L$ . This allows for gradient calculation with respect to $\tilde{\mathbf{x}}$ .
Gradient Calculation: The gradient of the attack objective $\ell(f_\theta(\tilde{\mathbf{x}}))$ with respect to the continuous input $\tilde{\mathbf{x}}$ is computed. The objective $\ell$ is typically cross-entropy measuring the likelihood of a harmful target response, potentially with additional terms like a low-perplexity reward for the adversarial suffix.
Gradient Update: The continuous representation is updated using a gradient step: $\tilde{\mathbf{x}}_t \leftarrow \tilde{\mathbf{x}}_{t-1} - \alpha \nabla_{\tilde{\mathbf{x}}_{t-1}} \ell(f_\theta(\tilde{\mathbf{x}}_{t-1}))$ , where $\alpha$ is the learning rate. The authors use Adam optimizer in their implementation.
Simplex Projection: After the gradient update, the updated continuous representation $\tilde{\mathbf{x}}_t$ is projected back onto the probabilistic simplex for each token position. This ensures that the sum of probabilities for each token position is 1 and values are non-negative. The projection algorithm involves sorting and is related to projection onto the $L^1$ ball, with a time complexity of $\mathcal{O}(|\mathcal{V}| \log |\mathcal{V}|)$ per token position.
Entropy Projection: A novel step introduced is the entropy projection, which uses the Gini Index ( $q=2$ Tsallis entropy) to control the distribution's entropy for each token. This projection encourages sparsity, pushing the continuous distribution towards a one-hot encoding, thus bridging the gap between the continuous optimization and the discrete token space. The projection is performed onto a hypersphere defined by the Gini index target, followed by a simplex projection. This procedure has $\mathcal{O}(|\mathcal{V}| \log |\mathcal{V}|)$ complexity per token position and helps in finding discrete solutions more effectively compared to previous gradient-based methods.
Flexible Sequence Length (Optional): To allow for the addition or removal of tokens in the adversarial suffix, an additional relaxation $\mathbf{m} \in [0,1]^L$ is introduced. This continuous mask is logarithmically transformed and added to the causal attention mask of the LLM. Optimizing $\mathbf{m}$ allows smoothly masking tokens (when $m_i=0$ ) or adding tokens (when $m_i>0$ ) from the perspective of the attention mechanism. This provides additional flexibility for the attack.
Discretization and Evaluation: Periodically, the continuous representation $\tilde{\mathbf{x}}_t$ is discretized by taking the $\argmax$ over the vocabulary dimension for each token position. The resulting discrete token sequence is then evaluated using the original LLM and objective $\ell$ . The best discrete sequence found so far is tracked ( $\tilde{\mathbf{x}}_{\text{best}}$ ).
Scheduling and Reinitialization: Learning rate and entropy projection parameters are scheduled, e.g., using linear ramp-up and cosine annealing with warm restarts. The attack can be reinitialized to the best intermediate solution if no improvement is seen for a configurable number of iterations, helping to escape local optima.

The overall PGD algorithm follows an iterative process as outlined in Algorithm 1 of the paper.

Algorithm: Projected Gradient Descent for LLM Attack
Input: LLM f_theta, initial prompt x_0 (discrete), loss l, learning rate alpha, epochs E
Output: Best adversarial prompt x_best (discrete)

1. Initialize relaxed one-hot encoding x_tilde_0 from x_0.
2. Initialize x_best = x_0.
3. For t = 1 to E:
    a. Compute gradient: grad = nabla_{x_tilde_{t-1}} l(f_theta(x_tilde_{t-1}))
    b. Gradient update: x_tilde_t = x_tilde_{t-1} - alpha * grad
    c. Project onto Simplex: x_tilde_t = SimplexProjection(x_tilde_t)
    d. Project onto controlled Entropy (Gini): x_tilde_t = EntropyProjection(x_tilde_t, target_entropy)
    e. Discretize: x_current_discrete = argmax(x_tilde_t, axis=-1)
    f. Evaluate discrete prompt: l_current = l(f_theta(x_current_discrete))
    g. If l_current is better than l(f_theta(x_best)):
        x_best = x_current_discrete
    h. Apply scheduling to alpha and target_entropy.
    i. Optional: Reinitialize x_tilde_t to one-hot of x_best if plateau detected.
4. Return x_best

The authors benchmarked their PGD implementation against GBDA (another gradient-based attack) and GCG (a discrete optimization attack) on the "behavior" jailbreaking task using Vicuna 1.3 7B, Falcon 7B, and Falcon 7B Instruct models.

The experimental results demonstrate the practical benefits:

Effectiveness: PGD achieves similar attack success rates (ASR) and target probabilities as GCG, which was previously considered state-of-the-art for attacking robust LLMs. For example, on Vicuna 1.3 7B, PGD reached 87% ASR at 60 seconds compared to GCG's 83%.
Efficiency: PGD is significantly faster than GCG. The paper reports up to an order of magnitude lower computational cost to achieve the same attack effectiveness. On Vicuna 1.3 7B, PGD achieved 28.2 iterations/sec compared to GCG's 0.3 iterations/sec, largely due to the ability to parallelize gradient computation more efficiently than batching discrete search steps.
Improvement over previous gradient methods: PGD is shown to be much more effective than GBDA, which had negligible attack success rates on these models. This highlights the importance of the careful control over the continuous relaxation error via projections.

Implementation considerations include:

Hardware: Experiments were conducted on a single A100 GPU (40GB RAM), demonstrating it's feasible on standard high-end AI hardware.
Precision: Using half-precision (FP16 or BF16) for forward/backward passes is crucial for memory efficiency with large models, while keeping optimizer parameters in 32 bits is standard practice.
Parallelization: The gradient-based nature of PGD allows for efficient parallel processing of multiple prompts on a single GPU, contributing to its speed advantage (running 25 distinct prompts in parallel in their setup). Discrete methods like GCG rely on batching candidates for a single prompt.
Hyperparameter Tuning: Like other optimization methods, PGD requires tuning of learning rate, entropy projection targets, and scheduling parameters.

Ablation studies confirm that both the choice of relaxation space (enforced by simplex projection) and the entropy projection contribute significantly to PGD's effectiveness compared to GBDA. The flexible sequence length also offers benefits.

A key limitation is the white-box assumption: PGD requires access to the model's parameters and architecture to compute gradients. This makes it directly applicable for testing and red teaming models where such access is available (e.g., internal models, open-source models) but not for attacking black-box models deployed as APIs (like ChatGPT, Claude, Gemini) without gradient approximation techniques.

The research contributes a practical and efficient method for generating adversarial text examples, which can be valuable for AI developers and researchers in understanding the vulnerabilities of LLMs, performing large-scale robustness evaluations, and potentially enabling more efficient adversarial training techniques to improve model alignment and safety.