User-Welfare Safety Evaluation
- User-welfare safety evaluation is an approach that measures AI risk by incorporating individualized user profiles and specific vulnerabilities.
- It contrasts with universal-risk frameworks by applying context-aware protocols, essential for high-stakes sectors such as health and finance.
- Empirical studies show that comprehensive context integration can lower safety scores by up to 2 points, revealing hidden risks for vulnerable groups.
User-welfare safety evaluation is the assessment of AI system outputs through the lens of an individual’s personal context, with the goal of capturing and minimizing harms that are contingent on specific user vulnerabilities, circumstances, and situations. This paradigm stands in contrast to universal-risk evaluations, which measure dangers or undesirable behaviors presumed to affect all users equivalently. User-welfare safety is especially critical in high-stakes domains such as health, finance, and personal advice, where identical outputs may have radically different safety ramifications due to variations in user background, resources, risk factors, and susceptibility. Emerging empirical evidence demonstrates that context-blind evaluation protocols frequently and systematically overestimate safety for vulnerable populations, while context-aware methodologies can reveal previously unrecognized deficiencies. The field is evolving toward multidimensional, profile-stratified, and context-integrated approaches designed to directly measure and mitigate individualized risks (Kempermann et al., 11 Dec 2025).
1. Foundational Concepts: Definitions and Contrasts
User-welfare safety ("UWS") is formally defined as the degree to which an LLM's advice, if followed by a particular user, minimizes financial, psychological, or physical harm conditional on that user’s specific demographic, personal, and situational vulnerabilities. UWS evaluation operationalizes safety not as a property of the output alone, but as a function mapping , contrasting sharply with universal-risk frameworks, where harm is evaluated with respect to a hypothetical average or generic user (Kempermann et al., 11 Dec 2025).
Universal-risk assessments typically focus on capabilities and behavioral pathologies (e.g., ability to generate bioweapon instructions, propensity for sycophancy), and use red-teaming protocols that do not condition on user context. In contrast, UWS requires the construction of structured user profiles (covering personal, financial, social, occupational, and vulnerability factors) and assessments stratified by context.
Key findings reveal that context-blind safety scores can differ by up to 2 points on a 7-point rubric when user vulnerability is accounted for (: $5/7$ vs. : $3/7$ for high-vulnerability users) (Kempermann et al., 11 Dec 2025, Wu et al., 24 May 2025, In et al., 20 Feb 2025).
2. Evaluation Design: Stratification, Profiles, and Context Modeling
Rigorous UWS evaluation demands the creation of multifactorial user profiles that encompass 14 distinct demographic and contextual dimensions—personal (age, gender, ethnicity, religion), financial (income, debt, savings), social (family situation, network, location), and capability factors (education, occupation, technical literacy, health/disability status) (Kempermann et al., 11 Dec 2025).
Evaluation Protocol Structure
| Component | Role | Best Practice |
|---|---|---|
| User Profile | Context conditioning | Stratify by low/medium/high vulnerability across domains; use domain experts for profile selection |
| Prompt Construction | Query synthesis | Source high-stakes, qualitative questions from real-world data (Reddit, professional cases) |
| LLM Response Generation | Output collection | Sample across model families (e.g., GPT-5, Claude, Gemini) |
| Judge Type | Context-blind vs. context-aware | Supply full profiles to evaluators; never rely solely on inferred or partial context |
| Scoring Rubric | Multidimensional evaluation | Apply Likelihood Severity Safeguards matrix for numerical safety score |
Context-aware judges (receiving user profiles) reveal substantially lower safety scores for vulnerable users than blindness protocols (Kempermann et al., 11 Dec 2025). Realistic prompt enrichment—injecting only the context users would voluntarily disclose—yields only partial improvement; thus, only comprehensive, structured profiles reliably expose individualized risks (Kempermann et al., 11 Dec 2025, Wu et al., 24 May 2025).
3. Metrics, Statistical Analysis, and Scoring Rubrics
Central to UWS assessment is a multidimensional safety rubric rating output risk on a 7-point scale:
- $7$ — Exceptionally Safe: negligible risk, excellent safeguards
- $5$ — Safe: adequate safeguards
- $3$ — Somewhat Unsafe: notable risk
- $1$ — Highly Dangerous: advice unsafe to enact
Statistical comparison between protocols (e.g., context-blind vs. context-aware) employs paired significance testing (Wilcoxon signed-rank, paired -tests with Cohen’s ):
where , = number of paired samples (Kempermann et al., 11 Dec 2025).
Context-aware evaluations documented a 2-point drop for high-vulnerability profiles (, in finance domain) (Kempermann et al., 11 Dec 2025).
4. Empirical Findings: Context Effects, Vulnerability, and Realistic Prompting
Systematic overestimation of safety under context-blind evaluation is the norm for high-vulnerability users. Attempts to remedy the gap by augmenting prompts with plausible user-disclosed context (ranking factors by perceived relevance or user volunteering likelihood) only partially narrow the disparity—moving safety scores from $3/7$ to at best $4/7$, leaving an irreducible gap (Kempermann et al., 11 Dec 2025).
No significant difference was found between relevance-ordered and likelihood-ordered prompt disclosures, demonstrating that best practice is to supply full, structured profiles to evaluators. Partial, user-disclosed context cannot reliably surface edge-case risks.
Findings generalize across models and domains: low-vulnerability users may see slight safety score increases under context awareness (due to safeguard recognition), but medium and high-vulnerability users consistently show decreased scores for both finance and health advice (Kempermann et al., 11 Dec 2025, Wu et al., 24 May 2025, In et al., 20 Feb 2025).
5. Practical Implementation and Artifacts
Published codebases and datasets accompany benchmark studies, equipped for robust UWS evaluation:
- Directory structures include JSON profiles, prompt templates, raw LLM outputs, judge instruction sets, and aggregation/analysis scripts.
- Template artifacts are provided for chain-of-thought annotation, demographic factors, and context-aware evaluation guidelines.
- Empirical pipelines support replication and expansion across themes and vulnerability tiers (Kempermann et al., 11 Dec 2025).
Best practices include stratified sampling of vulnerability tiers, use of domain-validated profile selection, multidimensional risk scoring, and validation of evaluators against human annotation (Kempermann et al., 11 Dec 2025, Wu et al., 24 May 2025).
6. Recommendations, Limitations, and Future Directions
Recommended methodological standards for UWS evaluations comprise:
- Profile stratification (low/medium/high vulnerability) in all evaluations.
- Use of context-aware, full-profile evaluators (either LLM-as-judge or human).
- Structured risk matrices superseding simplistic appropriateness scoring.
- Validation of automated judges against expert human annotation to detect and correct bias and calibration issues.
Open questions include expansion to broader domains and demographic profiles, development of behavioral realism metrics (e.g., actual user–LLM disclosure analysis, multi-turn interactions tracking evolving context), and regulatory alignment (e.g., compliance with GDPR, EU DSA risk assessment pipelines) (Kempermann et al., 11 Dec 2025). Robust UWS frameworks are foundational for personal safety certification, especially as regulatory obligations demand vulnerability-aware, individualized risk evaluation.
7. Relationship to Broader User-Centric Safety Research
UWS methodologies cohere with emerging standards in personalized and adaptable safety for robots (Prabhakar et al., 2022), recommender systems (Tennenholtz et al., 2023), agentic tool users (Kuntz et al., 17 Jun 2025, Vijayvargiya et al., 8 Jul 2025), and context-rich LLM outputs (Wu et al., 24 May 2025, In et al., 20 Feb 2025). Interdisciplinary approaches are converging on multidimensional, profile-conditioned, and dynamically adaptive evaluations as necessary for trustworthy AI deployment in sensitive, high-impact user-facing domains.
References:
- "Challenges of Evaluating LLM Safety for User Welfare" (Kempermann et al., 11 Dec 2025)
- "Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach" (Wu et al., 24 May 2025)
- "Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of LLMs" (In et al., 20 Feb 2025)
- "User-specific, Adaptable Safety Controllers Facilitate User Adoption in Human-Robot Collaboration" (Prabhakar et al., 2022)