Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey (2402.09283v3)

Published 14 Feb 2024 in cs.CL, cs.AI, cs.CY, and cs.LG

Abstract: LLMs are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at: https://github.com/niconi19/LLM-conversation-safety.

Attacks, Defenses, and Evaluations for LLM Conversation Safety: A Survey

Introduction

The advent of conversational LLMs has brought about a significant evolution in AI capabilities, encompassing both text generation and dialogue simulation. While these models offer substantial benefits for various applications, they also introduce risks related to generating harmful or unsafe content. Recent scholarly focus has been on assessing the safety of LLM conversations, specifically examining the pathways through which these models can be exploited to produce undesirable outputs and the measures that can be put in place to mitigate such risks. This survey explores recent studies on LLM conversation safety, categorizing the literature into attacks, defenses, and evaluation methodologies. A structured exploration of these areas reveals the intricacies involved in ensuring LLM conversations remain within the bounds of safety, highlighting both theoretical and practical implications for future research in generative AI.

Attacks on LLM Conversations

Research has identified two primary categories of attacks on LLMs: inference-time and training-time attacks.

  • Inference-Time Attacks focus on eliciting harmful outputs from LLMs via adversarial prompts without altering the underlying model weights. This category encompasses red-team attacks, template-based attacks, and neural prompt-to-prompt attacks, each with unique mechanisms for bypassing LLM safety measures.
  • Training-Time Attacks undermine LLM safety by manipulating model weights, often through data poisoning or the insertion of backdoor triggers during the fine-tuning process. These attacks demonstrate the vulnerability of LLMs to alterations in their training data, raising concerns about the robustness of LLM safety mechanisms.

Defenses Against LLM Attacks

The defense mechanisms against LLM attacks can be conceptualized within a hierarchical framework consisting of LLM safety alignment, inference guidance, and input/output filters.

  • LLM Safety Alignment involves the fine-tuning of models on curated datasets aimed at enhancing their inherent safety capabilities. Alignment algorithms include Supervised Fine-Tuning (SFT), various forms of Reinforcement Learning from Human Feedback (RLHF), and Direct Policy Optimization, each addressing specific aspects of LLM safety.
  • Inference Guidance employs system prompts and token selection adjustments during the inference process to steer LLM outputs towards safer content, without modifying the model parameters.
  • Input and Output Filters serve as the outermost layer of defense, detecting harmful content either through rule-based methods or model-based approaches that leverage the capabilities of learning-based models to identify and mitigate risks.

Evaluation of LLM Safety

The evaluation of LLM safety encompasses an assessment of both attacks and defense strategies using specially curated datasets and metrics.

  • Safety Datasets provide a comprehensive coverage of various topics related to harmful content, including toxicity, discrimination, privacy, and misinformation. These datasets encompass different formulations, such as red-team instructions, question-answer pairs, and dialogue data.
  • Metrics used in evaluating LLM safety include Attack Success Rate (ASR) and other fine-grained metrics assessing robustness, efficiency, and the false positive rate of attack and defense methods. These metrics offer insights into the effectiveness of various strategies in maintaining LLM conversation safety.

Conclusion and Future Directions

This survey underscores the significance of understanding and enhancing the safety of LLM conversations in the face of evolving attack methodologies. By categorizing existing studies into attacks, defenses, and evaluations, we illuminate the current landscape of LLM conversation safety research and elucidate potential pathways for future investigation. The need for continued exploration in aligning LLMs with safety objectives, developing robust defense mechanisms, and refining evaluation methods is evident, as these endeavors are crucial in advancing the field of generative AI towards socially beneficial outcomes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (108)
  1. Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity.
  2. Anthropic. 2023. Model card and evaluations for claude models.
  3. Eugene Bagdasaryan and Vitaly Shmatikov. 2022. Spinning language models: Risks of propaganda-as-a-service and countermeasures. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE.
  4. Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment.
  5. Purple llama cyberseceval: A secure coding benchmark for language models.
  6. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4.
  8. Stealthy and persistent unalignment on large language models via backdoor injections.
  9. Explore, establish, exploit: Red teaming language models from scratch.
  10. A survey on evaluation of large language models.
  11. Jailbreaking black box large language models in twenty queries.
  12. Gaining wisdom from setbacks: Aligning large language models via mistake analysis.
  13. Antisocial behavior in online discussion communities. In Proceedings of the international aaai conference on web and social media, volume 9, pages 61–70.
  14. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  15. Detecting hate speech with gpt-3.
  16. Fft: Towards harmlessness evaluation and analysis for llms with factuality, fairness, toxicity.
  17. Safe rlhf: Safe reinforcement learning from human feedback.
  18. Masterkey: Automated jailbreak across multiple large language model chatbots.
  19. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.
  20. Analyzing the inherent response tendency of llms: Real-world instructions-driven jailbreak.
  21. Hotflip: White-box adversarial examples for text classification.
  22. Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b.
  23. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
  24. Mart: Improving llm safety with multi-round automatic red-teaming.
  25. Realtoxicityprompts: Evaluating neural toxic degeneration in language models.
  26. Janis Goldzycher and Gerold Schneider. 2022. Hypothesis engineering for zero-shot hate speech detection.
  27. Google. 2023. Perspective.
  28. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection.
  29. Gradient-based adversarial attacks against text transformers.
  30. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.
  31. From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy.
  32. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection.
  33. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.
  34. Token-level adversarial prompt detection based on perplexity measures and contextual information.
  35. Baseline defenses for adversarial attacks against aligned language models.
  36. Categorical reparameterization with gumbel-softmax.
  37. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.
  38. Automatically auditing large language models via discrete optimization.
  39. Exploiting programmatic behavior of llms: Dual-use through standard security attacks.
  40. Robust safety classifier for large language models: Adversarial prompt shield.
  41. Lifetox: Unveiling implicit toxicity in life advice.
  42. Certifying llm safety against adversarial prompting.
  43. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b.
  44. Multi-step jailbreaking privacy attacks on chatgpt.
  45. P-bench: A multi-level privacy evaluation benchmark for language models.
  46. Deepinception: Hypnotize large language model to be jailbreaker.
  47. Rain: Your language models can align themselves without finetuning.
  48. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  49. Truthfulqa: Measuring how models mimic human falsehoods.
  50. Autodan: Generating stealthy jailbreak prompts on aligned large language models.
  51. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.
  52. A holistic approach to undesired content detection in the real world.
  53. Kris McGuffie and Alex Newhouse. 2020. The radicalization risks of gpt-3 and advanced neural language models.
  54. Flirt: Feedback loop in-context red teaming.
  55. Tree of attacks: Jailbreaking black-box llms automatically.
  56. Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities.
  57. Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, pages 145–153.
  58. OpenAI. 2023a. Gpt-4 technical report.
  59. OpenAI. 2023b. moderation.
  60. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  61. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  62. Red teaming language models with language models.
  63. Discovering language model behaviors with model-written evaluations.
  64. Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models.
  65. Llm self defense: By self examination, llms know they are being tricked.
  66. Bergeron: Combating adversarial attacks through a conscience-based alignment framework.
  67. Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models.
  68. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  69. Javier Rando and Florian Tramèr. 2023. Universal jailbreak backdoors from poisoned human feedback.
  70. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails.
  71. Smoothllm: Defending large language models against jailbreaking attacks.
  72. Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global scale prompt hacking competition.
  73. Adversarial attacks and defenses in large language models: Old and new threats.
  74. Scalable and transferable black-box jailbreaks for language models via persona modulation.
  75. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models.
  76. Autoprompt: Eliciting knowledge from language models with automatically generated prompts.
  77. On the exploitability of instruction tuning.
  78. Exploiting large language models (llms) through deception techniques and persuasion principles.
  79. Automatic identification of personal insults on social news sites. Journal of the American Society for Information Science and Technology, 63(2):270–285.
  80. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  81. Evil geniuses: Delving into the safety of llm-based agents.
  82. Llama 2: Open foundation and fine-tuned chat models.
  83. Saferdialogues: Taking feedback gracefully after conversational safety failures.
  84. Universal adversarial triggers for attacking and analyzing nlp.
  85. Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering.
  86. Poisoning language models during instruction tuning.
  87. Haoran Wang and Kai Shu. 2023. Backdoor activation attack: Attack large language models using activation steering for safety-alignment.
  88. Jailbreak and guard aligned language models with only few in-context demonstrations.
  89. Ethical and social risks of harm from language models.
  90. Defending chatgpt against jailbreak attack via self-reminder.
  91. Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions.
  92. Fine-grained human feedback gives better rewards for language model training.
  93. Ex machina: Personal attacks seen at scale.
  94. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models.
  95. Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2950–2968.
  96. Large language models as optimizers.
  97. Shadow alignment: The ease of subverting safely-aligned language models.
  98. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher.
  99. Rrhf: Rank responses to align language models with human feedback without tears.
  100. Defending against neural fake news.
  101. Removing rlhf protections in gpt-4 via fine-tuning.
  102. Safetybench: Evaluating the safety of large language models with multiple choice questions.
  103. Defending large language models against jailbreaking attacks through goal prioritization.
  104. Lima: Less is more for alignment.
  105. Beyond one-preference-for-all: Multi-objective direct preference optimization for language models.
  106. Autodan: Automatic and interpretable adversarial attacks on large language models.
  107. Adversarial training for high-stakes reliability.
  108. Universal and transferable adversarial attacks on aligned language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhichen Dong (4 papers)
  2. Zhanhui Zhou (13 papers)
  3. Chao Yang (333 papers)
  4. Jing Shao (109 papers)
  5. Yu Qiao (563 papers)
Citations (34)
Youtube Logo Streamline Icon: https://streamlinehq.com