Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness (2401.00287v1)

Published 30 Dec 2023 in cs.CL

Abstract: As LLMs play an increasingly pivotal role in natural language processing applications, their safety concerns become critical areas of NLP research. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark: a collection of diverse safe and unsafe prompts with carefully designed evaluation methods that facilitate systematic evaluation, comparison, and analysis over 'safety' and 'over-defensiveness.' With SODE, we study a variety of LLM defense strategies over multiple state-of-the-art LLMs, which reveals several interesting and important findings, such as (a) the widely popular 'self-checking' techniques indeed improve the safety against unsafe inputs, but this comes at the cost of extreme over-defensiveness on the safe inputs, (b) providing a safety instruction along with in-context exemplars (of both safe and unsafe inputs) consistently improves safety and also mitigates undue over-defensiveness of the models, (c) providing contextual knowledge easily breaks the safety guardrails and makes the models more vulnerable to generating unsafe responses. Overall, our work reveals numerous such critical findings that we believe will pave the way and facilitate further research in improving the safety of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  2. Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment.
  3. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  5. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348.
  6. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. Koala: A dialogue model for academic research. Blog post.
  11. DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
  12. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
  13. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
  14. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.
  15. Orca 2: Teaching small language models how to reason.
  16. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  17. Bbq: A hand-built bias benchmark for question answering.
  18. Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models.
  19. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  20. Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965.
  21. Ignore this title and HackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4945–4977, Singapore. Association for Computational Linguistics.
  22. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
  23. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  24. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  25. Accelerating llm inference by enabling intermediate layer decoding. arXiv preprint arXiv:2310.18581.
  26. Investigating selective prediction approaches across several tasks in IID, OOD, and adversarial settings. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1995–2002, Dublin, Ireland. Association for Computational Linguistics.
  27. Towards improving selective prediction ability of NLP systems. In Proceedings of the 7th Workshop on Representation Learning for NLP, pages 221–226, Dublin, Ireland. Association for Computational Linguistics.
  28. Do-not-answer: A dataset for evaluating safeguards in llms.
  29. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
  30. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  31. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  32. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Neeraj Varshney (47 papers)
  2. Pavel Dolin (4 papers)
  3. Agastya Seth (2 papers)
  4. Chitta Baral (152 papers)
Citations (26)