Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models (2405.14191v3)

Published 23 May 2024 in cs.CR and cs.CL

Abstract: LLMs have gained considerable attention for their revolutionary capabilities. However, there is also growing concern on their safety implications, making a comprehensive safety evaluation for LLMs urgently needed before model deployment. In this work, we propose S-Eval, a new comprehensive, multi-dimensional and open-ended safety evaluation benchmark. At the core of S-Eval is a novel LLM-based automatic test prompt generation and selection framework, which trains an expert testing LLM Mt combined with a range of test selection strategies to automatically construct a high-quality test suite for the safety evaluation. The key to the automation of this process is a novel expert safety-critique LLM Mc able to quantify the riskiness score of an LLM's response, and additionally produce risk tags and explanations. Besides, the generation process is also guided by a carefully designed risk taxonomy with four different levels, covering comprehensive and multi-dimensional safety risks of concern. Based on these, we systematically construct a new and large-scale safety evaluation benchmark for LLMs consisting of 220,000 evaluation prompts, including 20,000 base risk prompts (10,000 in Chinese and 10,000 in English) and 200,000 corresponding attack prompts derived from 10 popular adversarial instruction attacks against LLMs. Moreover, considering the rapid evolution of LLMs and accompanied safety threats, S-Eval can be flexibly configured and adapted to include new risks, attacks and models. S-Eval is extensively evaluated on 20 popular and representative LLMs. The results confirm that S-Eval can better reflect and inform the safety risks of LLMs compared to existing benchmarks. We also explore the impacts of parameter scales, language environments, and decoding parameters on the evaluation, providing a systematic methodology for evaluating the safety of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. AI@Meta. 2024. Llama 3 Model Card. {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}.
  3. Anonymous. 2024. The repository of our benchmark and experimental data. https://github.com/IS2Lab/S-Eval.
  4. Anthropic. 2023. Introducing Claude. https://www.anthropic.com/news/introducing-claude.
  5. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
  6. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. 675–718.
  7. Rishabh Bhardwaj and Soujanya Poria. 2023a. Language model unalignment: Parametric red-teaming to expose hidden harms and biases. arXiv preprint arXiv:2310.14303 (2023).
  8. Rishabh Bhardwaj and Soujanya Poria. 2023b. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662 (2023).
  9. Can GPT-3 perform statutory reasoning?. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law. 22–31.
  10. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 (2023).
  11. Multilingual Jailbreak Challenges in Large Language Models. In The Twelfth International Conference on Learning Representations.
  12. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 320–335.
  13. Keith F Durkin. 1997. Misuse of the Internet by pedophiles: Implications for law enforcement and probation practice. Fed. Probation 61 (1997), 14.
  14. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022).
  15. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462 (2020).
  16. Aligning AI With Shared Human Values. Proceedings of the International Conference on Learning Representations (2021).
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  18. Flames: Benchmarking value alignment of chinese large language models. arXiv preprint arXiv:2311.06899 (2023).
  19. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In The Twelfth International Conference on Learning Representations.
  20. Baidu Inc. 2023. ErnieBot. https://yiyan.baidu.com/.
  21. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
  22. Prompt packer: Deceiving llms through compositional instruction with hidden attacks. arXiv preprint arXiv:2310.10077 (2023).
  23. New era of artificial intelligence in education: Towards a sustainable multifaceted revolution. Sustainability 15, 16 (2023), 12451.
  24. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733 (2023).
  25. Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation. arXiv preprint arXiv:2311.18702 (2023).
  26. Chatgpt needs spade (sustainability, privacy, digital divide, and ethics) evaluation: A review. Cognitive Computation (2024), 1–23.
  27. Multi-step Jailbreaking Privacy Attacks on ChatGPT. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  28. SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models. arXiv preprint arXiv:2402.05044 (2024).
  29. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191 (2023).
  30. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).
  31. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  32. Prompting frameworks for large language models: A survey. arXiv preprint arXiv:2311.12785 (2023).
  33. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023).
  34. Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Symposium on Security and Privacy. 346–363.
  35. Stanley Milgram. 1963. Behavioral study of obedience. The Journal of abnormal and social psychology 67, 4 (1963), 371.
  36. OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt.
  37. OpenAI. 2024. Moderation. https://platform.openai.com/docs/guides/moderation.
  38. BBQ: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193 (2021).
  39. Societal biases in language generation: Progress and challenges. arXiv preprint arXiv:2105.04054 (2021).
  40. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Empirical Methods in Natural Language Processing.
  41. Beyond classification: Financial reasoning in state-of-the-art language models. arXiv preprint arXiv:2305.01505 (2023).
  42. Safety Assessment of Chinese Large Language Models. arXiv preprint arXiv:2304.10436 (2023).
  43. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024).
  44. Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360 (2023).
  45. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
  46. Gemma: Open Models Based on Gemini Research and Technology. arXiv preprint arXiv:2403.08295 (2024).
  47. Llama Team. 2024. Meta Llama Guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  50. Considerations for differentially private learning with large-scale public pretraining. arXiv preprint arXiv:2212.06470 (2022).
  51. Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine (2024), 1–9.
  52. Attention is all you need. Advances in neural information processing systems 30 (2017).
  53. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. Advances in Neural Information Processing Systems 36 (2024).
  54. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387 (2023).
  55. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems 36 (2024).
  56. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022).
  57. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387 (2023).
  58. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  59. Cvalues: Measuring the values of chinese large language models from safety to responsibility. arXiv preprint arXiv:2307.09705 (2023).
  60. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023).
  61. Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages. arXiv preprint arXiv:2305.18098 (2023).
  62. Yi: Open Foundation Models by 01. AI. arXiv preprint arXiv:2403.04652 (2024).
  63. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023).
  64. Sentiment analysis in the era of large language models: A reality check. arXiv preprint arXiv:2305.15005 (2023).
  65. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045 (2023).
  66. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024).
  67. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Xiaohan Yuan (7 papers)
  2. Jinfeng Li (40 papers)
  3. Dongxia Wang (18 papers)
  4. Yuefeng Chen (44 papers)
  5. Xiaofeng Mao (35 papers)
  6. Longtao Huang (27 papers)
  7. Hui Xue (109 papers)
  8. Wenhai Wang (123 papers)
  9. Kui Ren (169 papers)
  10. Jingyi Wang (105 papers)
Citations (3)