Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models (2502.15086v1)

Published 20 Feb 2025 in cs.CL

Abstract: As the use of LLM agents continues to grow, their safety vulnerabilities have become increasingly evident. Extensive benchmarks evaluate various aspects of LLM safety by defining the safety relying heavily on general standards, overlooking user-specific standards. However, safety standards for LLM may vary based on a user-specific profiles rather than being universally consistent across all users. This raises a critical research question: Do LLM agents act safely when considering user-specific safety standards? Despite its importance for safe LLM use, no benchmark datasets currently exist to evaluate the user-specific safety of LLMs. To address this gap, we introduce U-SAFEBENCH, the first benchmark designed to assess user-specific aspect of LLM safety. Our evaluation of 18 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards, marking a new discovery in this field. To address this vulnerability, we propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety. Our benchmark and code are available at https://github.com/yeonjun-in/U-SafeBench.

Collections

Summary

The paper introduces U-SafeBench, a novel dataset with user profiles and instructions, to rigorously evaluate user-specific safety risks in large language models.
Benchmarking 18 LLMs on U-SafeBench reveals a low average user-specific safety score of 18.6%, highlighting a significant trade-off between safety and helpfulness across contexts.
A proposed zero-shot chain-of-thought method effectively boosts user-specific safety scores (e.g., 63.8% to 83.5%) with minimal impact on response quality.

This paper rigorously extends LLM safety evaluation to account for user-specific risks by introducing a novel benchmark alongside quantitative metrics.

A new dataset—U-SafeBench—combines over 150 user profiles and 1,900 paired instructions to assess safety risks across physical, mental, and illicit activity scenarios.
Benchmarks of 18 LLMs reveal an average user-specific safety score of only 18.6%, highlighting a marked trade-off between safety and helpfulness across diverse user contexts.
A chain-of-thought method is proposed that, under zero-shot conditions, elevates safety scores (e.g., boosting one model’s score from 63.8% to 83.5%) with minimal loss in response quality.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (7)

GitHub

GitHub - yeonjun-in/U-SafeBench: Official github repo for U-SafeBench, a comprehensive benchmark to evaluate user-specific safety of LLMs. (5 stars)

Tweets

https://twitter.com/GptMaestro/status/1894037733477134518