SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models (2311.08370v2)
Abstract: The past year has seen rapid acceleration in the development of LLMs. However, without proper steering and safeguards, LLMs will readily follow malicious instructions, provide unsafe advice, and generate toxic content. We introduce SimpleSafetyTests (SST) as a new test suite for rapidly and systematically identifying such critical safety risks. The test suite comprises 100 test prompts across five harm areas that LLMs, for the vast majority of applications, should refuse to comply with. We test 11 open-access and open-source LLMs and four closed-source LLMs, and find critical safety weaknesses. While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme. Prepending a safety-emphasising system prompt substantially reduces the occurrence of unsafe responses, but does not completely stop them from happening. Trained annotators labelled every model response to SST (n = 3,000). We use these annotations to evaluate five AI safety filters (which assess whether a models' response is unsafe given a prompt) as a way of automatically evaluating models' performance on SST. The filters' performance varies considerably. There are also differences across the five harm areas, and on the unsafe versus safe responses. The widely-used Perspective API has 72% accuracy and a newly-created zero-shot prompt to OpenAI's GPT-4 performs best with 89% accuracy. Content Warning: This paper contains prompts and responses that relate to child abuse, suicide, self-harm and eating disorders, scams and fraud, illegal items, and physical harm.
- A taxonomy of cyber-harms: Defining the impacts of cyber-attacks and understanding how they propagate. Journal of Cybersecurity, 4(1):tyy006, 10 2018. ISSN 2057-2085. doi: 10.1093/cybsec/tyy006. URL https://doi.org/10.1093/cybsec/tyy006.
- Anthropic. Core Views on AI Safety. https://www.anthropic.com/index/core-views-on-ai-safety, 2023.
- Vulnerable young people and their experience of online risks. Human–Computer Interaction, 33(4):281–304, 2018. doi: 10.1080/07370024.2018.1437544.
- A unified taxonomy of harmful content. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pp. 125–137, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.alw-1.16. URL https://aclanthology.org/2020.alw-1.16.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp. 610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922.
- On the opportunities and risks of foundation models, 2022.
- Harms from increasingly agentic algorithmic systems. In 2023 ACM Conference on Fairness, Accountability, and Transparency. ACM, jun 2023. doi: 10.1145/3593013.3594033. URL https://doi.org/10.1145%2F3593013.3594033.
- Cohere. Responsibility - Developing safer language models. https://cohere.com/responsibility, 2023.
- The Atlantic Council. Scaling Trust on the Web. https://www.atlanticcouncil.org/in-depth-research-reports/report/scaling-trust/, 2023.
- Safe rlhf: Safe reinforcement learning from human feedback, 2023.
- SafetyKit: First aid for measuring safety in open-domain conversational systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4113–4133, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.284. URL https://aclanthology.org/2022.acl-long.284.
- World Economic Forum. Toolkit for Digital Safety Design Interventions and Innovations: Typology of Online Harms. https://www3.weforum.org/docs/WEF_Typology_of_Online_Harms_2023.pdf, 2023.
- Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1307–1323, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.117. URL https://www.aclweb.org/anthology/2020.findings-emnlp.117.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301.
- Improving alignment of dialogue agents via targeted human judgements, 2022.
- Safety and fairness for content moderation in generative models, 2023.
- ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3309–3326, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.234. URL https://aclanthology.org/2022.acl-long.234.
- A systematic review of hate speech automatic detection using natural language processing. CoRR, abs/2106.00742, 2021. URL https://arxiv.org/abs/2106.00742.
- Safety-gymnasium: A unified safe reinforcement learning benchmark, 2023.
- Mistral 7b, 2023.
- Multi-step jailbreaking privacy attacks on chatgpt, 2023.
- Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023.
- LMSYS. LMSYS Chatbot Arena Leaderboard. https://chat.lmsys.org/?leaderboard, 2023.
- Mary McHugh. Interrater reliability: The kappa statistic. Biochemia medica : časopis Hrvatskoga društva medicinskih biokemičara / HDMB, 22:276–82, 10 2012. doi: 10.11613/BM.2012.031.
- Mistral. Guardrailing. https://docs.mistral.ai/usage/guardrailing/, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- OpenAI. Product Safety Standards. https://openai.com/safety-standards, 2024.
- Digital Trust & Safety Partnership. Trust & Safety Glossary of Terms. https://dtspartnership.org/glossary/, 2023.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
- Online abuse and human rights: WOAH satellite session at RightsCon 2020. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pp. 1–6, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.alw-1.1. URL https://aclanthology.org/2020.alw-1.1.
- Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023.
- Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 175–190, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.13. URL https://aclanthology.org/2022.naacl-main.13.
- Xstest: A test suite for identifying exaggerated safety behaviours in large language models, 2023.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2023.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Challenges and frontiers in abusive content detection. In Proceedings of the Third Workshop on Abusive Language Online, pp. 80–93, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-3509. URL https://aclanthology.org/W19-3509.
- Jailbroken: How does llm safety training fail?, 2023.
- Ethical and social risks of harm from language models, 2021.
- Bertie Vidgen (35 papers)
- Nino Scherrer (16 papers)
- Hannah Rose Kirk (33 papers)
- Rebecca Qian (13 papers)
- Anand Kannappan (6 papers)
- Scott A. Hale (48 papers)
- Paul Röttger (37 papers)