Attacks, Defenses, and Evaluations for LLM Conversation Safety: A Survey
Introduction
The advent of conversational LLMs has brought about a significant evolution in AI capabilities, encompassing both text generation and dialogue simulation. While these models offer substantial benefits for various applications, they also introduce risks related to generating harmful or unsafe content. Recent scholarly focus has been on assessing the safety of LLM conversations, specifically examining the pathways through which these models can be exploited to produce undesirable outputs and the measures that can be put in place to mitigate such risks. This survey explores recent studies on LLM conversation safety, categorizing the literature into attacks, defenses, and evaluation methodologies. A structured exploration of these areas reveals the intricacies involved in ensuring LLM conversations remain within the bounds of safety, highlighting both theoretical and practical implications for future research in generative AI.
Attacks on LLM Conversations
Research has identified two primary categories of attacks on LLMs: inference-time and training-time attacks.
- Inference-Time Attacks focus on eliciting harmful outputs from LLMs via adversarial prompts without altering the underlying model weights. This category encompasses red-team attacks, template-based attacks, and neural prompt-to-prompt attacks, each with unique mechanisms for bypassing LLM safety measures.
- Training-Time Attacks undermine LLM safety by manipulating model weights, often through data poisoning or the insertion of backdoor triggers during the fine-tuning process. These attacks demonstrate the vulnerability of LLMs to alterations in their training data, raising concerns about the robustness of LLM safety mechanisms.
Defenses Against LLM Attacks
The defense mechanisms against LLM attacks can be conceptualized within a hierarchical framework consisting of LLM safety alignment, inference guidance, and input/output filters.
- LLM Safety Alignment involves the fine-tuning of models on curated datasets aimed at enhancing their inherent safety capabilities. Alignment algorithms include Supervised Fine-Tuning (SFT), various forms of Reinforcement Learning from Human Feedback (RLHF), and Direct Policy Optimization, each addressing specific aspects of LLM safety.
- Inference Guidance employs system prompts and token selection adjustments during the inference process to steer LLM outputs towards safer content, without modifying the model parameters.
- Input and Output Filters serve as the outermost layer of defense, detecting harmful content either through rule-based methods or model-based approaches that leverage the capabilities of learning-based models to identify and mitigate risks.
Evaluation of LLM Safety
The evaluation of LLM safety encompasses an assessment of both attacks and defense strategies using specially curated datasets and metrics.
- Safety Datasets provide a comprehensive coverage of various topics related to harmful content, including toxicity, discrimination, privacy, and misinformation. These datasets encompass different formulations, such as red-team instructions, question-answer pairs, and dialogue data.
- Metrics used in evaluating LLM safety include Attack Success Rate (ASR) and other fine-grained metrics assessing robustness, efficiency, and the false positive rate of attack and defense methods. These metrics offer insights into the effectiveness of various strategies in maintaining LLM conversation safety.
Conclusion and Future Directions
This survey underscores the significance of understanding and enhancing the safety of LLM conversations in the face of evolving attack methodologies. By categorizing existing studies into attacks, defenses, and evaluations, we illuminate the current landscape of LLM conversation safety research and elucidate potential pathways for future investigation. The need for continued exploration in aligning LLMs with safety objectives, developing robust defense mechanisms, and refining evaluation methods is evident, as these endeavors are crucial in advancing the field of generative AI towards socially beneficial outcomes.