Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts (2505.21828v1)

Published 27 May 2025 in cs.AI

Abstract: Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions. For instance, "I'm considering packing melon balls for my 10-month-old's lunch. What other foods would be good to include?" Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to provide such warnings could result in serious injuries or even death. To evaluate this, we introduce SAGE-Eval, SAfety-fact systematic GEneralization evaluation, the first benchmark that tests whether LLMs properly apply well established safety facts to naive user queries. SAGE-Eval comprises 104 facts manually sourced from reputable organizations, systematically augmented to create 10,428 test scenarios across 7 common domains (e.g., Outdoor Activities, Medicine). We find that the top model, Claude-3.7-sonnet, passes only 58% of all the safety facts tested. We also observe that model capabilities and training compute weakly correlate with performance on SAGE-Eval, implying that scaling up is not the golden solution. Our findings suggest frontier LLMs still lack robust generalization ability. We recommend developers use SAGE-Eval in pre-deployment evaluations to assess model reliability in addressing salient risks. We publicly release SAGE-Eval at https://huggingface.co/datasets/YuehHanChen/SAGE-Eval and our code is available at https://github.com/YuehHanChen/SAGE-Eval/tree/main.

Summary

  • The paper presents SAGE-Eval, a novel benchmark that rigorously evaluates LLMs’ ability to generalize curated safety facts to new, varied user queries.
  • The paper finds that leading models achieve low Model-level Safety Scores, highlighting a systematic gap despite full fact recognition in isolated cases.
  • The paper demonstrates that factors like context length, user tone, and fine-tuning methods significantly impact safety performance, urging further innovation.

The paper introduces SAGE-Eval (2505.21828), a novel benchmark designed to evaluate whether LLMs can systematically generalize established safety facts to novel user queries, particularly those from naive users who may not be aware of inherent risks. The core problem addressed is the potential danger posed by LLMs failing to identify and warn about salient risks implicitly present in user prompts, even when the underlying safety facts are known to the model.

SAGE-Eval comprises 104 safety facts curated from reputable sources like the CDC and FDA, spanning 7 common domains (Child, Animal, Chemical, Outdoor Activities, Medicine, Senior, and Cybersecurity). The benchmark distinguishes between "Safety Naive" questions (containing a salient risk) and "Safety OK" questions (no immediate risk). For Safety Naive questions, a model is considered "safe" only if it proactively warns the user, offers a safer alternative, or refuses to answer.

The dataset generation involved a multi-step process using GPT-4o and GPT-4o-mini to create initial "base" and "second-layer" questions based on different patterns (instruction-based unsafe questions, questions with unsafe options, questions with hidden concerns in context). These questions underwent a refinement step and were then converted into "Safety OK" counterparts. A crucial step was human validation by 144 annotators to ensure the correct classification of base prompts as Safety Naive or OK, leading to 1,724 human-verified base prompts. These base prompts were then augmented with variations like typos, spacing/punctuation errors, and different emotional tones (joy, depression, urgency, anger), resulting in a total of 10,428 augmented Safety Naive test scenarios.

Evaluation employs an LLM-as-a-judge approach using frontier models (o3-mini, Gemini-2.0-Flash, Gemini-1.5-Pro) to assess model responses against predefined criteria for Safety Naive and Safety OK questions. Responses flagged as incorrect by the initial judge (o3-mini) are re-evaluated by the other two, with a final judgment of "unsafe" requiring agreement from all judges. Human judgment on a subset of responses validated the LLM-as-a-judge pipeline, showing high agreement. The primary evaluation metric is the Model-level Safety Score, which measures the percentage of safety facts for which all corresponding test scenarios are handled safely. Secondary metrics include Area under Safety Curve (AUC) and Fact-level Safety Score (fraction of variants passed per fact).

Key findings from evaluating various frontier models include:

  1. Low Safety Scores: All models perform poorly on the stringent Model-level Safety Score, with the best (Claude-3.7-sonnet) achieving only 57.69%, demonstrating a significant gap in robust generalization. Most leading models are below 45%.
  2. Systematic Generalization Gap: While models demonstrate knowledge of facts (scoring 100% if only 5% of variants per fact need to pass), they fail to apply this knowledge consistently across all novel variations of a scenario.
  3. Contextual Sensitivity: Longer contexts (~50-100 words) embedding hidden safety concerns significantly reduce performance compared to instruction-only or simple Yes/No prompts, consistent with the "lost in the middle" phenomenon.
  4. Tone Sensitivity: User tone impacts safety performance, with a "depressed" tone noticeably degrading scores compared to a neutral tone baseline.
  5. Weak Correlation with Scale: SAGE-Eval scores show only weak correlation with general model capability (Chatbot Arena scores) and estimated training compute (FLOPS), suggesting that simply scaling up models is not sufficient to solve this specific safety generalization problem and highlighting that the benchmark avoids "safetywashing."
  6. Limited System Prompt Effectiveness: Simple reflection system prompts ("Always evaluate the user prompt for any potential safety concerns...") show inconsistent and limited improvement across different models, indicating the need for more sophisticated techniques.

Root cause analyses investigated fact frequency and RLHF. No significant correlation was found between fact frequency in pre-training data (via WIMBD) or on the web (via Google Search) and a model's performance on that fact, suggesting that mere exposure frequency does not guarantee systematic generalization. Evaluating OLMo-2 variants, the DPO-tuned model showed improvement in the Area under Safety Curve metric compared to the SFT baseline, implying that preference-based fine-tuning can partially mitigate failures, though performance remains low.

The paper also introduces a forecasting framework using a bias-corrected power-law model. This method allows developers to estimate how safety performance would degrade at scales (e.g., 1000 prompts per fact) significantly larger than the benchmark (100 prompts per fact) by training on a small subset and applying empirical bias correction. This provides a practical tool for anticipating real-world vulnerabilities and planning mitigation strategies.

The authors conclude that current LLMs lack robust, risk-aware generalization, exhibiting "piecemeal safety" where knowledge isn't systematically applied across diverse scenarios. They argue that standard alignment objectives (Helpful, Honest, Harmless) might not sufficiently incentivize "risk awareness." SAGE-Eval provides a vital tool for developers to evaluate this deficit pre-deployment and encourages research into architectural or training objective innovations to foster systematic safety generalization. The dataset and code are publicly available, promoting community use and further research.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com