- The paper presents SAGE-Eval, a novel benchmark that rigorously evaluates LLMs’ ability to generalize curated safety facts to new, varied user queries.
- The paper finds that leading models achieve low Model-level Safety Scores, highlighting a systematic gap despite full fact recognition in isolated cases.
- The paper demonstrates that factors like context length, user tone, and fine-tuning methods significantly impact safety performance, urging further innovation.
The paper introduces SAGE-Eval (2505.21828), a novel benchmark designed to evaluate whether LLMs can systematically generalize established safety facts to novel user queries, particularly those from naive users who may not be aware of inherent risks. The core problem addressed is the potential danger posed by LLMs failing to identify and warn about salient risks implicitly present in user prompts, even when the underlying safety facts are known to the model.
SAGE-Eval comprises 104 safety facts curated from reputable sources like the CDC and FDA, spanning 7 common domains (Child, Animal, Chemical, Outdoor Activities, Medicine, Senior, and Cybersecurity). The benchmark distinguishes between "Safety Naive" questions (containing a salient risk) and "Safety OK" questions (no immediate risk). For Safety Naive questions, a model is considered "safe" only if it proactively warns the user, offers a safer alternative, or refuses to answer.
The dataset generation involved a multi-step process using GPT-4o and GPT-4o-mini to create initial "base" and "second-layer" questions based on different patterns (instruction-based unsafe questions, questions with unsafe options, questions with hidden concerns in context). These questions underwent a refinement step and were then converted into "Safety OK" counterparts. A crucial step was human validation by 144 annotators to ensure the correct classification of base prompts as Safety Naive or OK, leading to 1,724 human-verified base prompts. These base prompts were then augmented with variations like typos, spacing/punctuation errors, and different emotional tones (joy, depression, urgency, anger), resulting in a total of 10,428 augmented Safety Naive test scenarios.
Evaluation employs an LLM-as-a-judge approach using frontier models (o3-mini, Gemini-2.0-Flash, Gemini-1.5-Pro) to assess model responses against predefined criteria for Safety Naive and Safety OK questions. Responses flagged as incorrect by the initial judge (o3-mini) are re-evaluated by the other two, with a final judgment of "unsafe" requiring agreement from all judges. Human judgment on a subset of responses validated the LLM-as-a-judge pipeline, showing high agreement. The primary evaluation metric is the Model-level Safety Score, which measures the percentage of safety facts for which all corresponding test scenarios are handled safely. Secondary metrics include Area under Safety Curve (AUC) and Fact-level Safety Score (fraction of variants passed per fact).
Key findings from evaluating various frontier models include:
- Low Safety Scores: All models perform poorly on the stringent Model-level Safety Score, with the best (Claude-3.7-sonnet) achieving only 57.69%, demonstrating a significant gap in robust generalization. Most leading models are below 45%.
- Systematic Generalization Gap: While models demonstrate knowledge of facts (scoring 100% if only 5% of variants per fact need to pass), they fail to apply this knowledge consistently across all novel variations of a scenario.
- Contextual Sensitivity: Longer contexts (~50-100 words) embedding hidden safety concerns significantly reduce performance compared to instruction-only or simple Yes/No prompts, consistent with the "lost in the middle" phenomenon.
- Tone Sensitivity: User tone impacts safety performance, with a "depressed" tone noticeably degrading scores compared to a neutral tone baseline.
- Weak Correlation with Scale: SAGE-Eval scores show only weak correlation with general model capability (Chatbot Arena scores) and estimated training compute (FLOPS), suggesting that simply scaling up models is not sufficient to solve this specific safety generalization problem and highlighting that the benchmark avoids "safetywashing."
- Limited System Prompt Effectiveness: Simple reflection system prompts ("Always evaluate the user prompt for any potential safety concerns...") show inconsistent and limited improvement across different models, indicating the need for more sophisticated techniques.
Root cause analyses investigated fact frequency and RLHF. No significant correlation was found between fact frequency in pre-training data (via WIMBD) or on the web (via Google Search) and a model's performance on that fact, suggesting that mere exposure frequency does not guarantee systematic generalization. Evaluating OLMo-2 variants, the DPO-tuned model showed improvement in the Area under Safety Curve metric compared to the SFT baseline, implying that preference-based fine-tuning can partially mitigate failures, though performance remains low.
The paper also introduces a forecasting framework using a bias-corrected power-law model. This method allows developers to estimate how safety performance would degrade at scales (e.g., 1000 prompts per fact) significantly larger than the benchmark (100 prompts per fact) by training on a small subset and applying empirical bias correction. This provides a practical tool for anticipating real-world vulnerabilities and planning mitigation strategies.
The authors conclude that current LLMs lack robust, risk-aware generalization, exhibiting "piecemeal safety" where knowledge isn't systematically applied across diverse scenarios. They argue that standard alignment objectives (Helpful, Honest, Harmless) might not sufficiently incentivize "risk awareness." SAGE-Eval provides a vital tool for developers to evaluate this deficit pre-deployment and encourages research into architectural or training objective innovations to foster systematic safety generalization. The dataset and code are publicly available, promoting community use and further research.