Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

129 tokens/sec

GPT-4o

28 tokens/sec

Gemini 2.5 Pro Pro

42 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

1 2

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey (2402.09283v3)

Published 14 Feb 2024 in cs.CL, cs.AI, cs.CY, and cs.LG

Abstract: LLMs are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at: https://github.com/niconi19/LLM-conversation-safety.

References (108)

Citations (34)

View on Semantic Scholar

Summary

The paper systematically categorizes studies on inference-time and training-time attacks, defense strategies, and evaluation metrics for LLM safety.
It details how adversarial prompts exploit LLM outputs while training-time attacks use data poisoning to undermine model integrity.
The survey emphasizes robust safety alignment, inference guidance, and filtering techniques, with metrics like Attack Success Rate (ASR) for assessment.

Attacks, Defenses, and Evaluations for LLM Conversation Safety: A Survey

Introduction

The advent of conversational LLMs has brought about a significant evolution in AI capabilities, encompassing both text generation and dialogue simulation. While these models offer substantial benefits for various applications, they also introduce risks related to generating harmful or unsafe content. Recent scholarly focus has been on assessing the safety of LLM conversations, specifically examining the pathways through which these models can be exploited to produce undesirable outputs and the measures that can be put in place to mitigate such risks. This survey explores recent studies on LLM conversation safety, categorizing the literature into attacks, defenses, and evaluation methodologies. A structured exploration of these areas reveals the intricacies involved in ensuring LLM conversations remain within the bounds of safety, highlighting both theoretical and practical implications for future research in generative AI.

Attacks on LLM Conversations

Research has identified two primary categories of attacks on LLMs: inference-time and training-time attacks.

Inference-Time Attacks focus on eliciting harmful outputs from LLMs via adversarial prompts without altering the underlying model weights. This category encompasses red-team attacks, template-based attacks, and neural prompt-to-prompt attacks, each with unique mechanisms for bypassing LLM safety measures.
Training-Time Attacks undermine LLM safety by manipulating model weights, often through data poisoning or the insertion of backdoor triggers during the fine-tuning process. These attacks demonstrate the vulnerability of LLMs to alterations in their training data, raising concerns about the robustness of LLM safety mechanisms.

Defenses Against LLM Attacks

The defense mechanisms against LLM attacks can be conceptualized within a hierarchical framework consisting of LLM safety alignment, inference guidance, and input/output filters.

LLM Safety Alignment involves the fine-tuning of models on curated datasets aimed at enhancing their inherent safety capabilities. Alignment algorithms include Supervised Fine-Tuning (SFT), various forms of Reinforcement Learning from Human Feedback (RLHF), and Direct Policy Optimization, each addressing specific aspects of LLM safety.
Inference Guidance employs system prompts and token selection adjustments during the inference process to steer LLM outputs towards safer content, without modifying the model parameters.
Input and Output Filters serve as the outermost layer of defense, detecting harmful content either through rule-based methods or model-based approaches that leverage the capabilities of learning-based models to identify and mitigate risks.

Evaluation of LLM Safety

The evaluation of LLM safety encompasses an assessment of both attacks and defense strategies using specially curated datasets and metrics.

Safety Datasets provide a comprehensive coverage of various topics related to harmful content, including toxicity, discrimination, privacy, and misinformation. These datasets encompass different formulations, such as red-team instructions, question-answer pairs, and dialogue data.
Metrics used in evaluating LLM safety include Attack Success Rate (ASR) and other fine-grained metrics assessing robustness, efficiency, and the false positive rate of attack and defense methods. These metrics offer insights into the effectiveness of various strategies in maintaining LLM conversation safety.

Conclusion and Future Directions

This survey underscores the significance of understanding and enhancing the safety of LLM conversations in the face of evolving attack methodologies. By categorizing existing studies into attacks, defenses, and evaluations, we illuminate the current landscape of LLM conversation safety research and elucidate potential pathways for future investigation. The need for continued exploration in aligning LLMs with safety objectives, developing robust defense mechanisms, and refining evaluation methods is evident, as these endeavors are crucial in advancing the field of generative AI towards socially beneficial outcomes.

PDF Markdown

Tweets

https://twitter.com/BrynElesedy/status/1759895063251784032

https://twitter.com/WGOV/status/1770524627720196465

https://twitter.com/WGOV/status/1773363333128659080

https://twitter.com/0xkidwai/status/1758227960148926896

YouTube

Show All Videos