Attacking Large Language Models with Projected Gradient Descent (2402.09154v1)
Abstract: Current LLM alignment methods are readily broken through specifically crafted adversarial prompts. While crafting adversarial prompts using discrete optimization is highly effective, such attacks typically use more than 100,000 LLM calls. This high computational cost makes them unsuitable for, e.g., quantitative analyses and adversarial training. To remedy this, we revisit Projected Gradient Descent (PGD) on the continuously relaxed input prompt. Although previous attempts with ordinary gradient-based attacks largely failed, we show that carefully controlling the error introduced by the continuous relaxation tremendously boosts their efficacy. Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.
- The Falcon Series of Open Language Models, 2023. URL http://arxiv.org/abs/2311.16867.
- Jailbreaking Black Box Large Language Models in Twenty Queries, 2023. URL http://arxiv.org/abs/2310.08419.
- Adversarial Robustness for Machine Learning. Academic Press, 2022. ISBN 978-0-12-824257-5.
- Efficient projections onto the l 11{}_{\textrm{1}}start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT -ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning - ICML ’08, pp. 272–279, Helsinki, Finland, 2008. ACM Press. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390191. URL http://portal.acm.org/citation.cfm?doid=1390156.1390191.
- Robustness of Graph Neural Networks at Scale. Neural Information Processing Systems, NeurIPS, 2021.
- Generalization of Neural Combinatorial Solvers Through the Lens of Adversarial Robustness. In International Conference on Learning Representations, ICLR, 2022. URL http://arxiv.org/abs/2110.10942.
- Adversarial Training for Graph Neural Networks: Pitfalls, Solutions, and New Directions. In Neural Information Processing Systems, NeurIPS, 2023.
- Gradient-based Adversarial Attacks against Text Transformers. In Conference on Empirical Methods in Natural Language Processing, pp. 5747–5757, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.464. URL https://aclanthology.org/2021.emnlp-main.464.
- Categorical Reparameterization with Gumbel-Softmax. In International Conference on Learning Representations, ICLR, 2016. URL https://openreview.net/forum?id=rkE3y85ee.
- Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, ICLR, 2015. URL http://arxiv.org/abs/1412.6980.
- Gradient-based Constrained Sampling from Language Models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2251–2277, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.144. URL https://aclanthology.org/2022.emnlp-main.144.
- Open Sesame! Universal Black Box Jailbreaking of Large Language Models, 2023. URL http://arxiv.org/abs/2309.01446.
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models, 2023. URL http://arxiv.org/abs/2310.04451.
- SGDR: Stochastic gradient descent with warm restarts. International Conference on Learning Representations, ICLR, pp. 1–16, 2017.
- Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations, ICLR, pp. 1–28, 2018.
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, 2024. URL http://arxiv.org/abs/2402.04249.
- Tree of Attacks: Jailbreaking Black-Box LLMs Automatically, 2023. URL http://arxiv.org/abs/2312.02119.
- Red Teaming Language Models with Language Models, 2022. URL http://arxiv.org/abs/2202.03286.
- Adversarial Attacks and Defenses in Large Language Models: Old and New Threats, 2023. URL http://arxiv.org/abs/2310.19737.
- Adversarial Examples on Object Recognition: A Comprehensive Survey, 2020. URL http://arxiv.org/abs/2008.04094.
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts, 2020. URL http://arxiv.org/abs/2010.15980.
- Intriguing properties of neural networks. International Conference on Learning Representations, ICLR, 2014.
- On Adaptive Attacks to Adversarial Example Defenses. Neural Information Processing Systems, NeurIPS, 33:1633–1645, 2020. URL https://proceedings.neurips.cc//paper_files/paper/2020/hash/11f38f8ecd71867b42433548d1078e38-Abstract.html.
- Constantino Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics, 52(1):479–487, 1988. ISSN 1572-9613. doi: 10.1007/BF01016429. URL https://doi.org/10.1007/BF01016429.
- Universal Adversarial Triggers for Attacking and Analyzing NLP, 2021. URL http://arxiv.org/abs/1908.07125.
- Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery, 2023. URL https://arxiv.org/abs/2302.03668v2.
- Gradient-Based Language Model Red Teaming, 2024. URL http://arxiv.org/abs/2401.16656.
- Topology attack and defense for graph neural networks: An optimization perspective. IJCAI International Joint Conference on Artificial Intelligence, 2019-Augus:3961–3967, 2019. ISSN 9780999241141. doi: 10.24963/ijcai.2019/550.
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023. URL http://arxiv.org/abs/2306.05685.
- AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models, 2023. URL http://arxiv.org/abs/2310.15140.
- Universal and Transferable Adversarial Attacks on Aligned Language Models, 2023. URL http://arxiv.org/abs/2307.15043.