Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Fast Adversarial Attacks on Language Models In One GPU Minute (2402.15570v1)

Published 23 Feb 2024 in cs.CR, cs.AI, and cs.CL

Abstract: In this paper, we introduce a novel class of fast, beam search-based adversarial attack (BEAST) for LLMs (LMs). BEAST employs interpretable parameters, enabling attackers to balance between attack speed, success rate, and the readability of adversarial prompts. The computational efficiency of BEAST facilitates us to investigate its applications on LMs for jailbreaking, eliciting hallucinations, and privacy attacks. Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute with a success rate of 89% when compared to a gradient-based baseline that takes over an hour to achieve 70% success rate using a single Nvidia RTX A6000 48GB GPU. Additionally, we discover a unique outcome wherein our untargeted attack induces hallucinations in LM chatbots. Through human evaluations, we find that our untargeted attack causes Vicuna-7B-v1.5 to produce ~15% more incorrect outputs when compared to LM outputs in the absence of our attack. We also learn that 22% of the time, BEAST causes Vicuna to generate outputs that are not relevant to the original prompt. Further, we use BEAST to generate adversarial prompts in a few seconds that can boost the performance of existing membership inference attacks for LMs. We believe that our fast attack, BEAST, has the potential to accelerate research in LM security and privacy. Our codebase is publicly available at https://github.com/vinusankars/BEAST.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Evaluating correctness and faithfulness of instruction-following models for question answering. arXiv preprint arXiv:2307.16877, 2023.
  3. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
  4. Detecting language model attacks with perplexity, 2023.
  5. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.
  6. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  7. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  8. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pp.  387–402. Springer, 2013.
  9. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  10. Towards evaluating the robustness of neural networks. corr abs/1608.04644 (2016). arXiv preprint arXiv:1608.04644, 2016.
  11. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp.  267–284, 2019.
  12. Extracting training data from large language models. arxiv. Preprint posted online December, 14:4, 2020.
  13. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp.  1897–1914. IEEE, 2022.
  14. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  15. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  16. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  17. DAN. Chat gpt ”dan” (and other ”jailbreaks”). URL https://gist.github.com/coolaj86/6f4f7b30129b0251f61fa7baaa881516. Accessed: 2024-01-31.
  18. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  19. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
  20. Mart: Improving llm safety with multi-round automatic red-teaming. arXiv preprint arXiv:2311.07689, 2023.
  21. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  22. Assessing the factual accuracy of generated text. In proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  166–175, 2019.
  23. Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733, 2021.
  24. Catastrophic jailbreak of open-source llms via exploiting generation, 2023.
  25. Baseline defenses for adversarial attacks against aligned language models, 2023.
  26. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328, 2017.
  27. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  28. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
  29. Can pretrained language models generate persuasive, faithful, and informative ad text for product descriptions? In Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), pp.  234–243, 2022.
  30. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023.
  31. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023.
  32. Helma: A large-scale hallucination evaluation benchmark for large language models. arXiv preprint arXiv:2305.11747, 2023a.
  33. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  6449–6464, 2023b.
  34. Truthfulqa: Measuring how models mimic human falsehoods, 2021.
  35. Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848, 2023a.
  36. A token-level reference-free hallucination detection benchmark for free-form text generation. arXiv preprint arXiv:2104.08704, 2021.
  37. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023b.
  38. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023c.
  39. Membership inference attacks against language models via neighbourhood comparison. arXiv preprint arXiv:2305.18462, 2023.
  40. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251, 2023.
  41. Generating benchmarks for factuality evaluation of language models. arXiv preprint arXiv:2307.06908, 2023.
  42. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852, 2023.
  43. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  44. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  45. The limitations of deep learning in adversarial settings. corr abs/1511.07528 (2015). arXiv preprint arXiv:1511.07528, 2015.
  46. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
  47. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  48. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
  49. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  50. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp.  3–18. IEEE, 2017.
  51. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
  52. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  54. Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214, 2023.
  55. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
  56. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  57. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  58. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668, 2023.
  59. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817, 2024.
  60. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF), pp.  268–282. IEEE, 2018.
  61. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
  62. Alignscore: Evaluating factual consistency with a unified alignment function. arXiv preprint arXiv:2305.16739, 2023.
  63. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  64. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
  65. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  66. Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140, 2023.
  67. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Citations (16)

Summary

  • The paper introduces BEAST, a beam search-based method that rapidly exploits LM vulnerabilities with high success rates in jailbreaking tasks.
  • It leverages interpretable hyperparameters to optimize attack speed, success, and output readability, outperforming traditional approaches.
  • Results highlight BEAST’s effectiveness in inducing hallucinations and boosting membership inference attacks, underlining critical privacy and security risks.

Exploiting Vulnerabilities in LLMs with BEAST: A Beam Search-based Adversarial Approach

Overview of BEAST

In recent years, the security and privacy implications of LLMs (LMs) have come under intense scrutiny. One particular area of concern is the susceptibility of these models to adversarial attacks. This paper introduces a novel, fast, and efficient method for conducting adversarial attacks against LMs, named Beam Search-based Adversarial Attack (BEAST). BEAST leverages interpretable hyperparameters to optimize the trade-off between the speed of attack, the success rate, and the readability of the generated adversarial prompts. The method demonstrates considerable efficiency, allowing it to perform targeted attacks on various models to induce incorrect outputs, reveal private information, or reduce utility by eliciting hallucinatory responses.

Jailbreaking Attacks

Jailbreaking attacks seek to induce LMs to generate outputs that are harmful or against their programmed ethical guidelines. BEAST demonstrates superior performance compared to existing methods, achieving high success rates in jailbreaking tasks across several LMs with minimal time investment, significantly outperforming gradient-based and manual methods. The speed and success rate of BEAST's jailbreaking capabilities highlight its potential for both exploring LM vulnerabilities and enhancing LM security against such threats.

Inducing Hallucinations

BEAST also proves effective in forcing LMs to elicit hallucinatory responses, generating outputs that are either factually incorrect or irrelevant. Through human and automated evaluations, it was shown that BEAST could significantly increase the rate of hallucinatory responses compared to baseline outputs. This application of BEAST not only illuminates the susceptibility of LMs to produce lower-quality outputs under adversarial influence but also prompts the need for robustness against such untargeted attacks.

Privacy Attacks

Beyond manipulating output content quality, BEAST further demonstrates its utility in privacy attacks, particularly in boosting the performance of membership inference attacks. By generating adversarial prompts that subtly alter the likelihoods of model outputs, BEAST can enhance the ability to infer whether a given input was part of the model's training set. This application raises significant concerns regarding model privacy and the potential for unintended information leakage, emphasizing the need for comprehensive privacy-preserving measures in LM development and deployment.

Implications and Future Directions

The introduction and successful application of BEAST across various adversarial settings underscore pressing security and privacy vulnerabilities in current LM architectures. The fast and efficient nature of BEAST attacks, coupled with their high success rates, highlight an urgent need for research into more robust defense mechanisms. Future work should focus on developing LMs resilient to such fast adversarial attacks, emphasizing the importance of security in the iterative design of generative AI systems.

Furthermore, the potential of BEAST to improve existing privacy attack methods signals a critical area for future exploration in AI ethics and security research. Ensuring that LMs can safeguard against both direct adversarial manipulations and subtler privacy invasions is paramount for their ethical and secure use in society.

Conclusion

BEAST represents a significant step forward in adversarial research against LMs, offering a fast, efficient, and highly effective method for exploring and exploiting vulnerabilities. Its applications in jailbreaking, inducing hallucinations, and enhancing privacy attacks provide valuable insights into the current state of LM security and privacy. By drawing attention to these vulnerabilities, BEAST also lays the groundwork for future advances in LM defenses, underscoring the ongoing interplay between AI capabilities and security in the digital age.