Mitigating Exaggerated Safety in Large Language Models (2405.05418v2)
Abstract: As the popularity of LLMs grow, combining model safety with utility becomes increasingly important. The challenge is making sure that LLMs can recognize and decline dangerous prompts without sacrificing their ability to be helpful. The problem of "exaggerated safety" demonstrates how difficult this can be. To reduce excessive safety behaviours -- which was discovered to be 26.1% of safe prompts being misclassified as dangerous and refused -- we use a combination of XSTest dataset prompts as well as interactive, contextual, and few-shot prompting to examine the decision bounds of LLMs such as Llama2, Gemma Command R+, and Phi-3. We find that few-shot prompting works best for Llama2, interactive prompting works best Gemma, and contextual prompting works best for Command R+ and Phi-3. Using a combination of these prompting strategies, we are able to mitigate exaggerated safety behaviors by an overall 92.9% across all LLMs. Our work presents a multiple prompting strategies to jailbreak LLMs' decision-making processes, allowing them to navigate the tight line between refusing unsafe prompts and remaining helpful.
- Phi-3 technical report: A highly capable language model locally on your phone.
- A general language assistant as a laboratory for alignment.
- B. Beizer and J. Wiley. 1996. Black box testing: Techniques for functional testing of software and systems. IEEE Software, 13(5):98–.
- Language models are few-shot learners. CoRR, abs/2005.14165.
- Red-teaming for generative ai: Silver bullet or security theater?
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
- Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics.
- Mart: Improving llm safety with multi-round automatic red-teaming.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models.
- Evallm: Interactive evaluation of large language model prompts on user-defined criteria. arXiv preprint arXiv:2309.13633.
- Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1352–1368, Seattle, United States. Association for Computational Linguistics.
- Pretraining language models with human preferences.
- Gpt-4 technical report.
- Training language models to follow instructions with human feedback.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
- HateCheck: Functional tests for hate speech detection models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 41–58, Online. Association for Computational Linguistics.
- Xstest: A test suite for identifying exaggerated safety behaviours in large language models.
- Gemma: Open models based on gemini research and technology.
- Llama 2: Open foundation and fine-tuned chat models.
- Jailbroken: How does llm safety training fail?
- Context-faithful prompting for large language models.
- Speak out of turn: Safety vulnerability of large language models in multi-turn dialogue.
- Ruchi Bhalani (2 papers)
- Ruchira Ray (7 papers)