Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models (2311.08370v2)

Published 14 Nov 2023 in cs.CL

Abstract: The past year has seen rapid acceleration in the development of LLMs. However, without proper steering and safeguards, LLMs will readily follow malicious instructions, provide unsafe advice, and generate toxic content. We introduce SimpleSafetyTests (SST) as a new test suite for rapidly and systematically identifying such critical safety risks. The test suite comprises 100 test prompts across five harm areas that LLMs, for the vast majority of applications, should refuse to comply with. We test 11 open-access and open-source LLMs and four closed-source LLMs, and find critical safety weaknesses. While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme. Prepending a safety-emphasising system prompt substantially reduces the occurrence of unsafe responses, but does not completely stop them from happening. Trained annotators labelled every model response to SST (n = 3,000). We use these annotations to evaluate five AI safety filters (which assess whether a models' response is unsafe given a prompt) as a way of automatically evaluating models' performance on SST. The filters' performance varies considerably. There are also differences across the five harm areas, and on the unsafe versus safe responses. The widely-used Perspective API has 72% accuracy and a newly-created zero-shot prompt to OpenAI's GPT-4 performs best with 89% accuracy. Content Warning: This paper contains prompts and responses that relate to child abuse, suicide, self-harm and eating disorders, scams and fraud, illegal items, and physical harm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. A taxonomy of cyber-harms: Defining the impacts of cyber-attacks and understanding how they propagate. Journal of Cybersecurity, 4(1):tyy006, 10 2018. ISSN 2057-2085. doi: 10.1093/cybsec/tyy006. URL https://doi.org/10.1093/cybsec/tyy006.
  2. Anthropic. Core Views on AI Safety. https://www.anthropic.com/index/core-views-on-ai-safety, 2023.
  3. Vulnerable young people and their experience of online risks. Human–Computer Interaction, 33(4):281–304, 2018. doi: 10.1080/07370024.2018.1437544.
  4. A unified taxonomy of harmful content. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pp.  125–137, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.alw-1.16. URL https://aclanthology.org/2020.alw-1.16.
  5. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp.  610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922.
  6. On the opportunities and risks of foundation models, 2022.
  7. Harms from increasingly agentic algorithmic systems. In 2023 ACM Conference on Fairness, Accountability, and Transparency. ACM, jun 2023. doi: 10.1145/3593013.3594033. URL https://doi.org/10.1145%2F3593013.3594033.
  8. Cohere. Responsibility - Developing safer language models. https://cohere.com/responsibility, 2023.
  9. The Atlantic Council. Scaling Trust on the Web. https://www.atlanticcouncil.org/in-depth-research-reports/report/scaling-trust/, 2023.
  10. Safe rlhf: Safe reinforcement learning from human feedback, 2023.
  11. SafetyKit: First aid for measuring safety in open-domain conversational systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4113–4133, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.284. URL https://aclanthology.org/2022.acl-long.284.
  12. World Economic Forum. Toolkit for Digital Safety Design Interventions and Innovations: Typology of Online Harms. https://www3.weforum.org/docs/WEF_Typology_of_Online_Harms_2023.pdf, 2023.
  13. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1307–1323, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.117. URL https://www.aclweb.org/anthology/2020.findings-emnlp.117.
  14. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301.
  15. Improving alignment of dialogue agents via targeted human judgements, 2022.
  16. Safety and fairness for content moderation in generative models, 2023.
  17. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3309–3326, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.234. URL https://aclanthology.org/2022.acl-long.234.
  18. A systematic review of hate speech automatic detection using natural language processing. CoRR, abs/2106.00742, 2021. URL https://arxiv.org/abs/2106.00742.
  19. Safety-gymnasium: A unified safe reinforcement learning benchmark, 2023.
  20. Mistral 7b, 2023.
  21. Multi-step jailbreaking privacy attacks on chatgpt, 2023.
  22. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023.
  23. LMSYS. LMSYS Chatbot Arena Leaderboard. https://chat.lmsys.org/?leaderboard, 2023.
  24. Mary McHugh. Interrater reliability: The kappa statistic. Biochemia medica : časopis Hrvatskoga društva medicinskih biokemičara / HDMB, 22:276–82, 10 2012. doi: 10.11613/BM.2012.031.
  25. Mistral. Guardrailing. https://docs.mistral.ai/usage/guardrailing/, 2023.
  26. OpenAI. Gpt-4 technical report, 2023.
  27. OpenAI. Product Safety Standards. https://openai.com/safety-standards, 2024.
  28. Digital Trust & Safety Partnership. Trust & Safety Glossary of Terms. https://dtspartnership.org/glossary/, 2023.
  29. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
  30. Online abuse and human rights: WOAH satellite session at RightsCon 2020. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pp.  1–6, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.alw-1.1. URL https://aclanthology.org/2020.alw-1.1.
  31. Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023.
  32. Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  175–190, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.13. URL https://aclanthology.org/2022.naacl-main.13.
  33. Xstest: A test suite for identifying exaggerated safety behaviours in large language models, 2023.
  34. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2023.
  35. Llama: Open and efficient foundation language models, 2023a.
  36. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  37. Challenges and frontiers in abusive content detection. In Proceedings of the Third Workshop on Abusive Language Online, pp.  80–93, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-3509. URL https://aclanthology.org/W19-3509.
  38. Jailbroken: How does llm safety training fail?, 2023.
  39. Ethical and social risks of harm from language models, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Bertie Vidgen (35 papers)
  2. Nino Scherrer (16 papers)
  3. Hannah Rose Kirk (33 papers)
  4. Rebecca Qian (13 papers)
  5. Anand Kannappan (6 papers)
  6. Scott A. Hale (48 papers)
  7. Paul Röttger (37 papers)
Citations (17)

Summary

Evaluating Safety Risks in LLMs via SIMPLESAFETYTESTS

The proliferation of LLMs has underscored the critical need for robust safety measures that prevent these models from generating harmful outputs. The paper "SIMPLESAFETYTESTS: A Test Suite for Identifying Critical Safety Risks in LLMs" addresses this necessity by providing a comprehensive test suite aimed at identifying safety weaknesses in these models. The authors present a structured approach to evaluating 15 LLMs using a curated set of prompts that span five harm categories: Suicide, Self-Harm, and Eating Disorders; Physical Harm; Illegal and Highly Regulated Items; Scams and Fraud; and Child Abuse.

Key Findings and Results

The SIMPLESAFETYTESTS (SST) test suite comprises 100 meticulously crafted prompts designed to elicit potentially unsafe responses from LLMs. The paper reports that, without a safety-emphasising system prompt, 20% of the total responses from LLMs were unsafe. Closed-source models demonstrated remarkably better performance, with only 2% of their responses being unsafe compared to 27% from open models. Notably, models like Claude 2.1 and Falcon (40B) yielded no unsafe responses, while others like Dolly v2 (12B) showed considerable safety deficiencies, with unsafe responses at a rate of 69%.

The introduction of a safety-emphasising system prompt reduced unsafe responses by an average of nine percentage points, illustrating its utility in enhancing model safety. However, this system prompt did not eliminate safety risks entirely, highlighting that the underlying issue may extend beyond simple inference-time interventions.

Implications and Future Directions

The implications of this research are significant for both theoretical advancement and practical deployment of LLMs. From a theoretical perspective, the paper suggests that LLM safety is inherently tied to context-specific evaluations. The testing suite's ability to identify safety risks provides a valuable tool for understanding how LLMs can be manipulated into producing harmful content, thereby guiding the development of more robust alignment techniques.

Practically, the findings underscore the urgency of deploying LLMs with additional layers of security. Developers and commercial labs are prompted to integrate system prompts that prioritize safety, potentially incorporating more advanced techniques like red-teaming and reinforcement learning from human feedback (RLHF) for heightened security. Moreover, the paper accentuates the disparity in safety compliance between open and closed-source models, perhaps indicating a need for increased transparency and collaboration in the open-source community to address these gaps.

Automated Evaluation of Safety

The research extends into exploring the feasibility of automating the evaluation of LLM responses using pre-existing safety filters. Despite variance in accuracy rates, the paper identifies OpenAI's zero-shot prompt to GPT-4 as the most effective tool, achieving 89% accuracy. This suggests potential for developing automated systems to monitor and evaluate LLM interactions, facilitating large-scale safety assessments.

Limitations and Prospects

While SIMPLESAFETYTESTS provide a clear-cut methodology for evaluating safety risks, they are limited by breadth, covering English test prompts and a fixed set of harm categories. Future expansions could encompass multilingual prompts and additional harm areas to increase relevancy across diverse applications. Moreover, the paper advocates for perturbation-based testing to enhance robustness, a promising avenue for further exploration.

Overall, this paper lays foundational groundwork for the systematic evaluation of LLM safety. It calls for strengthened safety measures that reflect the nuanced challenges posed by the rapid integration of LLMs into various applications. The structured methodology and preliminary findings serve as a roadmap for future research endeavors aimed at ensuring the safe and responsible development of AI technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com