2000 character limit reached
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey (2402.09283v3)
Published 14 Feb 2024 in cs.CL, cs.AI, cs.CY, and cs.LG
Abstract: LLMs are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at: https://github.com/niconi19/LLM-conversation-safety.
- Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity.
- Anthropic. 2023. Model card and evaluations for claude models.
- Eugene Bagdasaryan and Vitaly Shmatikov. 2022. Spinning language models: Risks of propaganda-as-a-service and countermeasures. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE.
- Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment.
- Purple llama cyberseceval: A secure coding benchmark for language models.
- Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions.
- Sparks of artificial general intelligence: Early experiments with gpt-4.
- Stealthy and persistent unalignment on large language models via backdoor injections.
- Explore, establish, exploit: Red teaming language models from scratch.
- A survey on evaluation of large language models.
- Jailbreaking black box large language models in twenty queries.
- Gaining wisdom from setbacks: Aligning large language models via mistake analysis.
- Antisocial behavior in online discussion communities. In Proceedings of the international aaai conference on web and social media, volume 9, pages 61–70.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Detecting hate speech with gpt-3.
- Fft: Towards harmlessness evaluation and analysis for llms with factuality, fairness, toxicity.
- Safe rlhf: Safe reinforcement learning from human feedback.
- Masterkey: Automated jailbreak across multiple large language model chatbots.
- A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.
- Analyzing the inherent response tendency of llms: Real-world instructions-driven jailbreak.
- Hotflip: White-box adversarial examples for text classification.
- Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
- Mart: Improving llm safety with multi-round automatic red-teaming.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models.
- Janis Goldzycher and Gerold Schneider. 2022. Hypothesis engineering for zero-shot hate speech detection.
- Google. 2023. Perspective.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection.
- Gradient-based adversarial attacks against text transformers.
- Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.
- From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy.
- Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.
- Token-level adversarial prompt detection based on perplexity measures and contextual information.
- Baseline defenses for adversarial attacks against aligned language models.
- Categorical reparameterization with gumbel-softmax.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset.
- Automatically auditing large language models via discrete optimization.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks.
- Robust safety classifier for large language models: Adversarial prompt shield.
- Lifetox: Unveiling implicit toxicity in life advice.
- Certifying llm safety against adversarial prompting.
- Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b.
- Multi-step jailbreaking privacy attacks on chatgpt.
- P-bench: A multi-level privacy evaluation benchmark for language models.
- Deepinception: Hypnotize large language model to be jailbreaker.
- Rain: Your language models can align themselves without finetuning.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Truthfulqa: Measuring how models mimic human falsehoods.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.
- A holistic approach to undesired content detection in the real world.
- Kris McGuffie and Alex Newhouse. 2020. The radicalization risks of gpt-3 and advanced neural language models.
- Flirt: Feedback loop in-context red teaming.
- Tree of attacks: Jailbreaking black-box llms automatically.
- Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities.
- Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, pages 145–153.
- OpenAI. 2023a. Gpt-4 technical report.
- OpenAI. 2023b. moderation.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Red teaming language models with language models.
- Discovering language model behaviors with model-written evaluations.
- Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models.
- Llm self defense: By self examination, llms know they are being tricked.
- Bergeron: Combating adversarial attacks through a conscience-based alignment framework.
- Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Javier Rando and Florian Tramèr. 2023. Universal jailbreak backdoors from poisoned human feedback.
- Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails.
- Smoothllm: Defending large language models against jailbreaking attacks.
- Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global scale prompt hacking competition.
- Adversarial attacks and defenses in large language models: Old and new threats.
- Scalable and transferable black-box jailbreaks for language models via persona modulation.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts.
- On the exploitability of instruction tuning.
- Exploiting large language models (llms) through deception techniques and persuasion principles.
- Automatic identification of personal insults on social news sites. Journal of the American Society for Information Science and Technology, 63(2):270–285.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Evil geniuses: Delving into the safety of llm-based agents.
- Llama 2: Open foundation and fine-tuned chat models.
- Saferdialogues: Taking feedback gracefully after conversational safety failures.
- Universal adversarial triggers for attacking and analyzing nlp.
- Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering.
- Poisoning language models during instruction tuning.
- Haoran Wang and Kai Shu. 2023. Backdoor activation attack: Attack large language models using activation steering for safety-alignment.
- Jailbreak and guard aligned language models with only few in-context demonstrations.
- Ethical and social risks of harm from language models.
- Defending chatgpt against jailbreak attack via self-reminder.
- Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions.
- Fine-grained human feedback gives better rewards for language model training.
- Ex machina: Personal attacks seen at scale.
- Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models.
- Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2950–2968.
- Large language models as optimizers.
- Shadow alignment: The ease of subverting safely-aligned language models.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher.
- Rrhf: Rank responses to align language models with human feedback without tears.
- Defending against neural fake news.
- Removing rlhf protections in gpt-4 via fine-tuning.
- Safetybench: Evaluating the safety of large language models with multiple choice questions.
- Defending large language models against jailbreaking attacks through goal prioritization.
- Lima: Less is more for alignment.
- Beyond one-preference-for-all: Multi-objective direct preference optimization for language models.
- Autodan: Automatic and interpretable adversarial attacks on large language models.
- Adversarial training for high-stakes reliability.
- Universal and transferable adversarial attacks on aligned language models.