Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models (2403.04786v2)

Published 3 Mar 2024 in cs.CR and cs.CL

Abstract: LLMs have become a cornerstone in the field of NLP, offering transformative capabilities in understanding and generating human-like text. However, with their rising prominence, the security and vulnerability aspects of these models have garnered significant attention. This paper presents a comprehensive survey of the various forms of attacks targeting LLMs, discussing the nature and mechanisms of these attacks, their potential impacts, and current defense strategies. We delve into topics such as adversarial attacks that aim to manipulate model outputs, data poisoning that affects model training, and privacy concerns related to training data exploitation. The paper also explores the effectiveness of different attack methodologies, the resilience of LLMs against these attacks, and the implications for model integrity and user trust. By examining the latest research, we provide insights into the current landscape of LLM vulnerabilities and defense mechanisms. Our objective is to offer a nuanced understanding of LLM attacks, foster awareness within the AI community, and inspire robust solutions to mitigate these risks in future developments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  3. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565.
  4. A general language assistant as a laboratory for alignment.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  6. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875.
  7. Explore, establish, exploit: Red teaming language models from scratch. arXiv preprint arXiv:2306.09442.
  8. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  9. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 782–791.
  10. The janus interface: How fine-tuning in large language models amplifies the privacy risks. arXiv preprint arXiv:2310.15469.
  11. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
  12. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  14. Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130.
  15. Llm censorship: A machine learning challenge or a computer security problem? arXiv preprint arXiv:2307.10719.
  16. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
  17. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001.
  18. Token-level adversarial prompt detection based on perplexity measures and contextual information.
  19. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987.
  20. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
  21. JailBreakChat. 2023. Jailbreakchat. https://www.jailbreakchat.com/.
  22. Baseline defenses for adversarial attacks against aligned language models.
  23. Prompt packer: Deceiving llms through compositional instruction with hidden attacks. arXiv preprint arXiv:2310.10077.
  24. Attackeval: How to evaluate the effectiveness of jailbreak attacking on large language models. arXiv preprint arXiv:2401.09002.
  25. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381.
  26. Robust safety classifier for large language models: Adversarial prompt shield.
  27. Privacy in large language models: Attacks, defenses and future directions. arXiv preprint arXiv:2310.10383.
  28. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
  29. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191.
  30. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  31. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
  32. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
  33. Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, page 100017.
  34. Prompt injection attacks and defenses in llm-integrated applications.
  35. Analyzing leakage of personally identifiable information in language models. arXiv preprint arXiv:2302.00539.
  36. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119.
  37. Propane: Prompt design as an inverse problem. arXiv preprint arXiv:2311.07064.
  38. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435.
  39. Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. ML Safety Workshop NeurIPS.
  40. Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213.
  41. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
  42. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. arXiv preprint arXiv:2310.10501.
  43. Smoothllm: Defending large language models against jailbreaking attacks.
  44. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  45. Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global scale prompt hacking competition. arXiv preprint arXiv:2311.16119.
  46. Loft: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model. arXiv preprint arXiv:2310.04445.
  47. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
  48. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844.
  49. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
  50. ”Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. CoRR abs/2308.03825.
  51. On the exploitability of instruction tuning. arXiv preprint arXiv:2306.17194.
  52. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  53. Tensor trust: Interpretable prompt injection attacks from an online game. arXiv preprint arXiv:2311.01011.
  54. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418.
  55. Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering. Transactions of the Association for Computational Linguistics, 7:387–401.
  56. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
  57. Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2950–2968.
  58. An llm can fool itself: A prompt-based adversarial attack. arXiv preprint arXiv:2310.13345.
  59. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. arXiv preprint arXiv:2312.02003.
  60. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446.
  61. Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553.
  62. Learning and forgetting unsafe examples in large language models. arXiv preprint arXiv:2312.12736.
  63. Prompt as triggers for backdoor attack examining the vulnerability in language models. arXiv preprint arXiv2305.01219.
  64. A survey of large language models. arXiv preprint arXiv:2303.18223.
  65. Adversarial training for high-stakes reliability. Advances in Neural Information Processing Systems, 35:9274–9286.
  66. Universal and transferable adversarial attacks on aligned language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Arijit Ghosh Chowdhury (6 papers)
  2. Md Mofijul Islam (8 papers)
  3. Vaibhav Kumar (50 papers)
  4. Faysal Hossain Shezan (4 papers)
  5. Vinija Jain (42 papers)
  6. Aman Chadha (109 papers)
Citations (13)