Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings (2402.16006v2)

Published 25 Feb 2024 in cs.CL

Abstract: The safety defense methods of LLMs(LLMs) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. Recent studies found that attaching suffixes to harmful instructions can hack the defense of LLMs and lead to dangerous outputs. However, similar to traditional text adversarial attacks, this approach, while effective, is limited by the challenge of the discrete tokens. This gradient based discrete optimization attack requires over 100,000 LLM calls, and due to the unreadable of adversarial suffixes, it can be relatively easily penetrated by common defense methods such as perplexity filters. To cope with this challenge, in this paper, we proposes an Adversarial Suffix Embedding Translation Framework (ASETF), aimed at transforming continuous adversarial suffix embeddings into coherent and understandable text. This method greatly reduces the computational overhead during the attack process and helps to automatically generate multiple adversarial samples, which can be used as data to strengthen LLMs security defense. Experimental evaluations were conducted on Llama2, Vicuna, and other prominent LLMs, employing harmful directives sourced from the Advbench dataset. The results indicate that our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques, while significantly enhancing the textual fluency of the prompts. In addition, our approach can be generalized into a broader method for generating transferable adversarial suffixes that can successfully attack multiple LLMs, even black-box LLMs, such as ChatGPT and Gemini.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90.
  2. Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial nlp. arXiv preprint arXiv:2210.10683.
  3. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773.
  4. Masterkey: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
  5. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  6. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56. IEEE.
  7. Llm censorship: A machine learning challenge or a computer security problem? arXiv preprint arXiv:2307.10719.
  8. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
  9. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916.
  10. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
  11. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
  12. Sok: Certified robustness for deep neural networks. In 2023 IEEE symposium on security and privacy (SP), pages 1289–1310. IEEE.
  13. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343.
  14. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  15. Crafting adversarial input sequences for recurrent neural networks. In MILCOM 2016-2016 IEEE Military Communications Conference, pages 49–54. IEEE.
  16. Red teaming language models with language models, 2022. URL https://arxiv. org/abs/2202.03286.
  17. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  18. Suranjana Samanta and Sameep Mehta. 2017. Towards crafting text adversarial samples. arXiv preprint arXiv:1707.02812.
  19. Lei Sha. 2020. Gradient-guided unsupervised lexically constrained text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8692–8703.
  20. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
  21. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
  22. The art of defending: A systematic evaluation and analysis of llm defense strategies on safety and over-defensiveness. arXiv preprint arXiv:2401.00287.
  23. Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6 billion parameter autoregressive language model.
  24. Hao Wang and Lei Sha. 2024. Harnessing the plug-and-play controller by prompting. arXiv preprint arXiv:2402.04160.
  25. Jailbroken: How does llm safety training fail?
  26. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668.
  27. Gradient-based language model red teaming. arXiv preprint arXiv:2401.16656.
  28. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. arXiv preprint arXiv:2312.02003.
  29. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
  30. Jade: A linguistics-based safety evaluation platform for llm. arXiv preprint arXiv:2311.00286.
  31. Texygen: A benchmarking platform for text generation models. In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 1097–1100.
  32. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hao Wang (1119 papers)
  2. Hao Li (803 papers)
  3. Minlie Huang (225 papers)
  4. Lei Sha (34 papers)
Citations (6)