Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attacking Large Language Models with Projected Gradient Descent (2402.09154v1)

Published 14 Feb 2024 in cs.LG

Abstract: Current LLM alignment methods are readily broken through specifically crafted adversarial prompts. While crafting adversarial prompts using discrete optimization is highly effective, such attacks typically use more than 100,000 LLM calls. This high computational cost makes them unsuitable for, e.g., quantitative analyses and adversarial training. To remedy this, we revisit Projected Gradient Descent (PGD) on the continuously relaxed input prompt. Although previous attempts with ordinary gradient-based attacks largely failed, we show that carefully controlling the error introduced by the continuous relaxation tremendously boosts their efficacy. Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.

This paper "Attacking LLMs with Projected Gradient Descent" (Geisler et al., 14 Feb 2024 ) introduces a computationally efficient method for crafting adversarial prompts for LLMs by adapting Projected Gradient Descent (PGD). The core idea is to optimize a continuous relaxation of the input token sequence, which allows the use of gradient-based methods, and then project the result back into a space conducive to finding effective discrete token sequences. This approach aims to overcome the high computational cost associated with state-of-the-art discrete optimization methods like GCG [zou_universal_2023].

The problem addressed is the high cost of finding adversarial examples that can "jailbreak" aligned LLMs. Existing effective attacks often require tens or hundreds of thousands of LLM queries, making them expensive for applications like large-scale evaluation or adversarial training. The authors propose using PGD on a continuous representation of the input prompt, which is a common technique in attacking models in continuous domains like image classification.

The practical implementation of this PGD approach involves several key steps:

  1. Continuous Relaxation: The discrete one-hot encoding of the input token sequence xVL\mathbf{x} \in \mathcal{V}^L (where V\mathcal{V} is the vocabulary and LL is the sequence length) is relaxed to a continuous representation x~[0,1]L×V\tilde{\mathbf{x}} \in [0,1]^{L \times |\mathcal{V}|}. This representation satisfies the probabilistic simplex constraint: i=1Vx~li=1\sum_{i=1}^{|\mathcal{V}|} \tilde{x}_{li} = 1 for each token position l=1,,Ll=1, \dots, L. This allows for gradient calculation with respect to x~\tilde{\mathbf{x}}.
  2. Gradient Calculation: The gradient of the attack objective (fθ(x~))\ell(f_\theta(\tilde{\mathbf{x}})) with respect to the continuous input x~\tilde{\mathbf{x}} is computed. The objective \ell is typically cross-entropy measuring the likelihood of a harmful target response, potentially with additional terms like a low-perplexity reward for the adversarial suffix.
  3. Gradient Update: The continuous representation is updated using a gradient step: x~tx~t1αx~t1(fθ(x~t1))\tilde{\mathbf{x}}_t \leftarrow \tilde{\mathbf{x}}_{t-1} - \alpha \nabla_{\tilde{\mathbf{x}}_{t-1}} \ell(f_\theta(\tilde{\mathbf{x}}_{t-1})), where α\alpha is the learning rate. The authors use Adam optimizer in their implementation.
  4. Simplex Projection: After the gradient update, the updated continuous representation x~t\tilde{\mathbf{x}}_t is projected back onto the probabilistic simplex for each token position. This ensures that the sum of probabilities for each token position is 1 and values are non-negative. The projection algorithm involves sorting and is related to projection onto the L1L^1 ball, with a time complexity of O(VlogV)\mathcal{O}(|\mathcal{V}| \log |\mathcal{V}|) per token position.
  5. Entropy Projection: A novel step introduced is the entropy projection, which uses the Gini Index (q=2q=2 Tsallis entropy) to control the distribution's entropy for each token. This projection encourages sparsity, pushing the continuous distribution towards a one-hot encoding, thus bridging the gap between the continuous optimization and the discrete token space. The projection is performed onto a hypersphere defined by the Gini index target, followed by a simplex projection. This procedure has O(VlogV)\mathcal{O}(|\mathcal{V}| \log |\mathcal{V}|) complexity per token position and helps in finding discrete solutions more effectively compared to previous gradient-based methods.
  6. Flexible Sequence Length (Optional): To allow for the addition or removal of tokens in the adversarial suffix, an additional relaxation m[0,1]L\mathbf{m} \in [0,1]^L is introduced. This continuous mask is logarithmically transformed and added to the causal attention mask of the LLM. Optimizing m\mathbf{m} allows smoothly masking tokens (when mi=0m_i=0) or adding tokens (when mi>0m_i>0) from the perspective of the attention mechanism. This provides additional flexibility for the attack.
  7. Discretization and Evaluation: Periodically, the continuous representation x~t\tilde{\mathbf{x}}_t is discretized by taking the arg max\argmax over the vocabulary dimension for each token position. The resulting discrete token sequence is then evaluated using the original LLM and objective \ell. The best discrete sequence found so far is tracked (x~best\tilde{\mathbf{x}}_{\text{best}}).
  8. Scheduling and Reinitialization: Learning rate and entropy projection parameters are scheduled, e.g., using linear ramp-up and cosine annealing with warm restarts. The attack can be reinitialized to the best intermediate solution if no improvement is seen for a configurable number of iterations, helping to escape local optima.

The overall PGD algorithm follows an iterative process as outlined in Algorithm 1 of the paper.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Algorithm: Projected Gradient Descent for LLM Attack
Input: LLM f_theta, initial prompt x_0 (discrete), loss l, learning rate alpha, epochs E
Output: Best adversarial prompt x_best (discrete)

1. Initialize relaxed one-hot encoding x_tilde_0 from x_0.
2. Initialize x_best = x_0.
3. For t = 1 to E:
    a. Compute gradient: grad = nabla_{x_tilde_{t-1}} l(f_theta(x_tilde_{t-1}))
    b. Gradient update: x_tilde_t = x_tilde_{t-1} - alpha * grad
    c. Project onto Simplex: x_tilde_t = SimplexProjection(x_tilde_t)
    d. Project onto controlled Entropy (Gini): x_tilde_t = EntropyProjection(x_tilde_t, target_entropy)
    e. Discretize: x_current_discrete = argmax(x_tilde_t, axis=-1)
    f. Evaluate discrete prompt: l_current = l(f_theta(x_current_discrete))
    g. If l_current is better than l(f_theta(x_best)):
        x_best = x_current_discrete
    h. Apply scheduling to alpha and target_entropy.
    i. Optional: Reinitialize x_tilde_t to one-hot of x_best if plateau detected.
4. Return x_best

The authors benchmarked their PGD implementation against GBDA (another gradient-based attack) and GCG (a discrete optimization attack) on the "behavior" jailbreaking task using Vicuna 1.3 7B, Falcon 7B, and Falcon 7B Instruct models.

The experimental results demonstrate the practical benefits:

  • Effectiveness: PGD achieves similar attack success rates (ASR) and target probabilities as GCG, which was previously considered state-of-the-art for attacking robust LLMs. For example, on Vicuna 1.3 7B, PGD reached 87% ASR at 60 seconds compared to GCG's 83%.
  • Efficiency: PGD is significantly faster than GCG. The paper reports up to an order of magnitude lower computational cost to achieve the same attack effectiveness. On Vicuna 1.3 7B, PGD achieved 28.2 iterations/sec compared to GCG's 0.3 iterations/sec, largely due to the ability to parallelize gradient computation more efficiently than batching discrete search steps.
  • Improvement over previous gradient methods: PGD is shown to be much more effective than GBDA, which had negligible attack success rates on these models. This highlights the importance of the careful control over the continuous relaxation error via projections.

Implementation considerations include:

  • Hardware: Experiments were conducted on a single A100 GPU (40GB RAM), demonstrating it's feasible on standard high-end AI hardware.
  • Precision: Using half-precision (FP16 or BF16) for forward/backward passes is crucial for memory efficiency with large models, while keeping optimizer parameters in 32 bits is standard practice.
  • Parallelization: The gradient-based nature of PGD allows for efficient parallel processing of multiple prompts on a single GPU, contributing to its speed advantage (running 25 distinct prompts in parallel in their setup). Discrete methods like GCG rely on batching candidates for a single prompt.
  • Hyperparameter Tuning: Like other optimization methods, PGD requires tuning of learning rate, entropy projection targets, and scheduling parameters.

Ablation studies confirm that both the choice of relaxation space (enforced by simplex projection) and the entropy projection contribute significantly to PGD's effectiveness compared to GBDA. The flexible sequence length also offers benefits.

A key limitation is the white-box assumption: PGD requires access to the model's parameters and architecture to compute gradients. This makes it directly applicable for testing and red teaming models where such access is available (e.g., internal models, open-source models) but not for attacking black-box models deployed as APIs (like ChatGPT, Claude, Gemini) without gradient approximation techniques.

The research contributes a practical and efficient method for generating adversarial text examples, which can be valuable for AI developers and researchers in understanding the vulnerabilities of LLMs, performing large-scale robustness evaluations, and potentially enabling more efficient adversarial training techniques to improve model alignment and safety.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. The Falcon Series of Open Language Models, 2023. URL http://arxiv.org/abs/2311.16867.
  2. Jailbreaking Black Box Large Language Models in Twenty Queries, 2023. URL http://arxiv.org/abs/2310.08419.
  3. Adversarial Robustness for Machine Learning. Academic Press, 2022. ISBN 978-0-12-824257-5.
  4. Efficient projections onto the l 11{}_{\textrm{1}}start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT -ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning - ICML ’08, pp.  272–279, Helsinki, Finland, 2008. ACM Press. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390191. URL http://portal.acm.org/citation.cfm?doid=1390156.1390191.
  5. Robustness of Graph Neural Networks at Scale. Neural Information Processing Systems, NeurIPS, 2021.
  6. Generalization of Neural Combinatorial Solvers Through the Lens of Adversarial Robustness. In International Conference on Learning Representations, ICLR, 2022. URL http://arxiv.org/abs/2110.10942.
  7. Adversarial Training for Graph Neural Networks: Pitfalls, Solutions, and New Directions. In Neural Information Processing Systems, NeurIPS, 2023.
  8. Gradient-based Adversarial Attacks against Text Transformers. In Conference on Empirical Methods in Natural Language Processing, pp.  5747–5757, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.464. URL https://aclanthology.org/2021.emnlp-main.464.
  9. Categorical Reparameterization with Gumbel-Softmax. In International Conference on Learning Representations, ICLR, 2016. URL https://openreview.net/forum?id=rkE3y85ee.
  10. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, ICLR, 2015. URL http://arxiv.org/abs/1412.6980.
  11. Gradient-based Constrained Sampling from Language Models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2251–2277, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.144. URL https://aclanthology.org/2022.emnlp-main.144.
  12. Open Sesame! Universal Black Box Jailbreaking of Large Language Models, 2023. URL http://arxiv.org/abs/2309.01446.
  13. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models, 2023. URL http://arxiv.org/abs/2310.04451.
  14. SGDR: Stochastic gradient descent with warm restarts. International Conference on Learning Representations, ICLR, pp.  1–16, 2017.
  15. Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations, ICLR, pp.  1–28, 2018.
  16. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, 2024. URL http://arxiv.org/abs/2402.04249.
  17. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically, 2023. URL http://arxiv.org/abs/2312.02119.
  18. Red Teaming Language Models with Language Models, 2022. URL http://arxiv.org/abs/2202.03286.
  19. Adversarial Attacks and Defenses in Large Language Models: Old and New Threats, 2023. URL http://arxiv.org/abs/2310.19737.
  20. Adversarial Examples on Object Recognition: A Comprehensive Survey, 2020. URL http://arxiv.org/abs/2008.04094.
  21. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts, 2020. URL http://arxiv.org/abs/2010.15980.
  22. Intriguing properties of neural networks. International Conference on Learning Representations, ICLR, 2014.
  23. On Adaptive Attacks to Adversarial Example Defenses. Neural Information Processing Systems, NeurIPS, 33:1633–1645, 2020. URL https://proceedings.neurips.cc//paper_files/paper/2020/hash/11f38f8ecd71867b42433548d1078e38-Abstract.html.
  24. Constantino Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics, 52(1):479–487, 1988. ISSN 1572-9613. doi: 10.1007/BF01016429. URL https://doi.org/10.1007/BF01016429.
  25. Universal Adversarial Triggers for Attacking and Analyzing NLP, 2021. URL http://arxiv.org/abs/1908.07125.
  26. Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery, 2023. URL https://arxiv.org/abs/2302.03668v2.
  27. Gradient-Based Language Model Red Teaming, 2024. URL http://arxiv.org/abs/2401.16656.
  28. Topology attack and defense for graph neural networks: An optimization perspective. IJCAI International Joint Conference on Artificial Intelligence, 2019-Augus:3961–3967, 2019. ISSN 9780999241141. doi: 10.24963/ijcai.2019/550.
  29. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023. URL http://arxiv.org/abs/2306.05685.
  30. AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models, 2023. URL http://arxiv.org/abs/2310.15140.
  31. Universal and Transferable Adversarial Attacks on Aligned Language Models, 2023. URL http://arxiv.org/abs/2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Simon Geisler (24 papers)
  2. Tom Wollschläger (15 papers)
  3. M. H. I. Abdalla (2 papers)
  4. Johannes Gasteiger (18 papers)
  5. Stephan Günnemann (169 papers)
Citations (31)
Reddit Logo Streamline Icon: https://streamlinehq.com