PAL: Proxy-Guided Black-Box Attack on Large Language Models (2402.09674v1)
Abstract: LLMs have surged in popularity in recent months, but they have demonstrated concerning capabilities to generate harmful content when manipulated. While techniques like safety fine-tuning aim to minimize harmful use, recent works have shown that LLMs remain vulnerable to attacks that elicit toxic responses. In this work, we introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting. In particular, it relies on a surrogate model to guide the optimization and a sophisticated loss designed for real-world LLM APIs. Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art. We also propose GCG++, an improvement to the GCG attack that reaches 94% ASR on white-box Llama-2-7B, and the Random-Search Attack on LLMs (RAL), a strong but simple baseline for query-based attacks. We believe the techniques proposed in this work will enable more comprehensive safety testing of LLMs and, in the long term, the development of better security guardrails. The code can be found at https://github.com/chawins/pal.
- Andriushchenko, M. Adversarial attacks on GPT-4 via simple random search, December 2023. URL https://www.andriushchenko.me/gpt4adv.pdf.
- Constitutional AI: Harmlessness from AI feedback, December 2022. URL http://arxiv.org/abs/2212.08073.
- Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. ArXiv, abs/2209.02128, 2022. URL https://api.semanticscholar.org/CorpusID:252090310.
- Language models are few-shot learners, July 2020. URL http://arxiv.org/abs/2005.14165.
- Blackbox attacks via surrogate ensemble search. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 5348–5362. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/23b9d4e18b151ba2108fb3f1efaf8de4-Paper-Conference.pdf.
- Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57, 2017. doi: 10.1109/SP.2017.49.
- On evaluating adversarial robustness. arXiv:1902.06705 [cs, stat], February 2019a. URL http://arxiv.org/abs/1902.06705.
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp. 267–284, Santa Clara, CA, August 2019b. USENIX Association. ISBN 978-1-939133-06-9. URL https://www.usenix.org/conference/usenixsecurity19/presentation/carlini.
- Are aligned neural networks adversarially aligned?, June 2023. URL http://arxiv.org/abs/2306.15447.
- Jailbreaking black box large language models in twenty queries, October 2023. URL http://arxiv.org/abs/2310.08419.
- Stateful detection of black-box adversarial attacks. In Proceedings of the 1st ACM Workshop on Security and Privacy on Artificial Intelligence, SPAI ’20, pp. 30–39, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 978-1-4503-7611-2. doi: 10.1145/3385003.3410925. URL https://doi.org/10.1145/3385003.3410925.
- Improving black-box adversarial attacks with a transfer-based prior. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL https://arxiv.org/abs/2204.02311.
- Computer, T. RedPajama: An open dataset for training large language models, October 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- MasterKey: Automated jailbreak across multiple large language model chatbots, October 2023. URL http://arxiv.org/abs/2307.08715.
- Boosting adversarial attacks with momentum. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, March 2018. URL http://arxiv.org/abs/1710.06081.
- Stateful defenses for machine learning models are not yet secure against black-box attacks, September 2023. URL http://arxiv.org/abs/2303.06280.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, November 2022. URL http://arxiv.org/abs/2209.07858.
- The pile: An 800GB dataset of diverse text for language modeling, December 2020. URL http://arxiv.org/abs/2101.00027.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022. URL https://arxiv.org/abs/2209.14375.
- Google. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. URL https://arxiv.org/abs/2312.11805.
- Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection, May 2023. URL http://arxiv.org/abs/2302.12173.
- Model extraction and adversarial transferability, your BERT is vulnerable! In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y. (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2006–2012, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.161. URL https://aclanthology.org/2021.naacl-main.161.
- Publishing efficient on-device models increases adversarial vulnerability. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp. 271–290, Los Alamitos, CA, USA, February 2023. IEEE Computer Society. doi: 10.1109/SaTML54575.2023.00026. URL https://doi.ieeecomputersociety.org/10.1109/SaTML54575.2023.00026.
- Black-box adversarial attack with transferable model-based embedding. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJxhNTNYwB.
- Baseline defenses for adversarial attacks against aligned language models, September 2023. URL http://arxiv.org/abs/2309.00614.
- Automatically auditing large language models via discrete optimization, March 2023. URL http://arxiv.org/abs/2303.04381.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks, February 2023. URL http://arxiv.org/abs/2302.05733.
- Pretraining language models with human preferences. In International Conference on Machine Learning, pp. 17506–17533. PMLR, 2023. URL https://arxiv.org/abs/2302.08582.
- Open sesame! Universal black box jailbreaking of large language models, September 2023. URL http://arxiv.org/abs/2309.01446.
- ADDA: An adversarial direction-guided decision-based attack via multiple surrogate models. Mathematics, 11(3613), 2023. ISSN 2227-7390. doi: 10.3390/math11163613. URL https://www.mdpi.com/2227-7390/11/16/3613.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2023. URL https://arxiv.org/abs/2310.04451.
- Attacking deep networks with surrogate-based adversarial black-box methods is easy. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Zf4ZdI4OQPV.
- Improving adversarial transferability via model alignment, November 2023. URL http://arxiv.org/abs/2311.18495.
- Tree of attacks: Jailbreaking black-box LLMs automatically, December 2023. URL http://arxiv.org/abs/2312.02119.
- Language model inversion, November 2023. URL http://arxiv.org/abs/2311.13647.
- OpenAI. Gpt-4 system card. 2023. URL https://cdn.openai.com/papers/gpt-4-system-card.pdf.
- Training language models to follow instructions with human feedback, March 2022. URL http://arxiv.org/abs/2203.02155.
- The RefinedWeb dataset for falcon LLM: Outperforming curated corpora with web data, and web data only, June 2023. URL http://arxiv.org/abs/2306.01116.
- Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3419–3448, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.225.
- Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop, 2022. URL https://openreview.net/forum?id=qiaRo_7Zmug.
- Exploring the limits of transfer learning with a unified text-to-text transformer, September 2023. URL http://arxiv.org/abs/1910.10683.
- Optimization for Data Analysis. Cambridge University Press, Cambridge, 2022. ISBN 978-1-316-51898-4. doi: 10.1017/9781009004282. URL https://www.cambridge.org/core/product/C02C3708905D236AA354D1CE1739A6A2.
- Reddit, 2023. URL https://www.reddit.com/r/ChatGPT/comments/10x1nux/dan_prompt/.
- LoFT: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model, October 2023. URL http://arxiv.org/abs/2310.04445.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023. URL https://arxiv.org/abs/2308.03825.
- AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4222–4235, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.346. URL https://aclanthology.org/2020.emnlp-main.346.
- Defending against transfer attacks from public models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Tvwf4Vsi5F.
- Dolma: An open corpus of three trillion tokens for language model pretraining research, January 2024. URL http://arxiv.org/abs/2402.00159.
- Takemoto, K. All in how you ask for it: Simple black-box method for jailbreak attacks, January 2024. URL http://arxiv.org/abs/2401.09798.
- LLaMA: Open and efficient foundation language models, February 2023a. URL http://arxiv.org/abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models, July 2023b. URL http://arxiv.org/abs/2307.09288.
- Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475–494, June 2001. ISSN 1573-2878. doi: 10.1023/A:1017501703105. URL https://doi.org/10.1023/A:1017501703105.
- Adversarial risk and the dangers of evaluating against weak attacks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5025–5034. PMLR, July 2018. URL https://proceedings.mlr.press/v80/uesato18a.html.
- With great training comes great vulnerability: Practical attacks against transfer learning. In 27th USENIX Security Symposium (USENIX Security 18), pp. 1281–1297, Baltimore, MD, August 2018. USENIX Association. ISBN 978-1-939133-04-5. URL https://www.usenix.org/conference/usenixsecurity18/presentation/wang-bolun.
- Jailbroken: How does LLM safety training fail?, July 2023. URL http://arxiv.org/abs/2307.02483.
- Ethical and social risks of harm from language models. ArXiv, abs/2112.04359, 2021. URL https://api.semanticscholar.org/CorpusID:244954639.
- Taxonomy of risks posed by language models. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022. URL https://api.semanticscholar.org/CorpusID:249872629.
- Towards understanding and improving the transferability of adversarial examples in deep neural networks. In Proceedings of The 12th Asian Conference on Machine Learning, pp. 837–850. PMLR, September 2020. URL https://proceedings.mlr.press/v129/wu20a.html.
- Subspace attack: Exploiting promising subspaces for query-efficient black-box attacks. arXiv:1906.04392 [cs], June 2019. URL http://arxiv.org/abs/1906.04392.
- Tree of thoughts: Deliberate problem solving with large language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=5Xc1ecxO1h.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023. URL https://arxiv.org/abs/2310.02446.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts, 2023. URL https://arxiv.org/abs/2309.10253.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024. URL https://arxiv.org/abs/2401.06373.
- Universal and transferable adversarial attacks on aligned language models, 2023.
- Chawin Sitawarin (26 papers)
- Norman Mu (13 papers)
- David Wagner (67 papers)
- Alexandre Araujo (23 papers)