Safe@1 Accuracy in Risk-Aware AI
- The paper defines Safe@1 accuracy as the fraction of safe responses from k sampled outputs, establishing it as a robust measure for deployment risk in large reasoning models.
- Empirical evaluations show that top-tier models achieve Safe@1 values between 95–98%, while smaller or distilled models experience significant safety performance drops.
- Enhancement techniques like safety-oriented fine-tuning and certified neural safety layers boost Safe@1 without sacrificing overall response accuracy.
Safe@1 accuracy is a metric central to the evaluation of safety in large reasoning models (LRMs) deployed in risk-sensitive applications. It quantifies the probability that a model will produce a response judged “safe” when presented with a single risky query, thereby directly reflecting user experience in real-world, single-call deployments. Safe@1 serves as a critical indicator for risk-aware AI evaluation and is sensitive to both model quality and deployment configuration, offering a rigorous alternative to superficial assessments of model safety (Zheng et al., 26 May 2025).
1. Formal Definition and Mathematical Formulation
Safe@1 is defined for a given risky query by measuring, across independent samples (default ), the fraction of responses judged safe. Let be the indicator for the th sampled response:
A response is counted as “safe” () if, according to an automated judge model (GPT-4o-1120), it refuses or safely redirects the risky request as per prescribed risk guidelines. This aggregation over multiple draws yields an empirical estimate of the probability that the model’s first response is in compliance with safety specifications (Zheng et al., 26 May 2025).
2. Rationale and Critical Importance
Top-1, or “@1,” safety accuracy is particularly germane for practical deployments. In interactive systems, the user typically issues a single prompt and expects one reply. Safe@1 captures the risk that a lone interaction results in unsafe or policy-violating content. While multi-sample approaches (Safe@k) can be useful for stress-testing worst-case scenarios, Safe@1 provides a robust measure of real-world risk exposure, directly mapping to the probability that the initial model output is safe. A high Safe@1 value is thus a prerequisite for reliable, risk-aware system deployment (Zheng et al., 26 May 2025).
3. Evaluation Protocol and Judging Criteria
The Beyond Safe Answers (BSA) benchmark operationalizes Safe@1 through a systematic protocol:
- Sampling Regime: For each risky query, the model is sampled times, holding system and prompt settings constant. Each model employs its recommended decoding hyperparameters.
- Safety Judging: Each output is scored as “safe” or “unsafe” using the GPT-4o-1120 LLM-as-judge in a Chain Exposure paradigm. Safety is contingent upon adherence to refusal, redirection, or other risk-mitigating guidelines specific to nine annotated risk categories.
- Ablations: Variations in decoding strategy (temperature, top-p, top-k) were tested and found to have negligible impact on Safe@1, indicating that the metric is dominated by the model’s underlying training and architecture, not sampling randomness (Zheng et al., 26 May 2025).
4. Empirical Results Across Model Scales and Scenarios
Comprehensive Safe@1 measurements across 19 leading models reveal a strong dependence on parameter count and fine-tuning strategy. For representative open- and closed-source models:
| Model | Safe@1 (%) | Param. Scale |
|---|---|---|
| Qwen3-30B-A3B | 98.27 | 30B |
| Qwen3-14B | 98.19 | 14B |
| Qwen3-0.6B | 41.09 | 0.6B |
| R1-Distill-Qwen-1.5B | 39.96 | 1.5B Distilled |
| Doubao-1.5-thinking-pro | 92.97 | Large Closed-source |
For all models, Safe@1 was much higher in large, foundational architectures, with values between 95–98% for top-tier models. Distilled and small-parameter models suffered pronounced drops, aligning Safe@1 with model expressivity and safety alignment capacity (Zheng et al., 26 May 2025).
Additionally, analysis by SSA (Superficial Safety Alignment) scenario demonstrates that models exhibit variable safety behavior across different types of risk exposure, with “Risk Omission” scenarios particularly challenging for low-capacity models.
5. Distinction from Underlying Reasoning Fidelity
A critical observation in the BSA benchmark is the disparity between Safe@1 and the Think@1 metric, which measures reasoning correctness (the model’s ability to internalize and identify risk rationales). Top models achieve Safe@1 near 98%, yet their Think@1 accuracy (correct internal risk rationale identification) is only ~38%. This discrepancy exposes Superficial Safety Alignment, where models reliably refuse or redirect risky prompts but lack genuine, explicit recognition of the underlying risks. A plausible implication is that Safe@1 may overestimate the depth of safety reasoning, highlighting the necessity for additional metrics targeting internal risk awareness (Zheng et al., 26 May 2025).
6. Factors Affecting Safe@1 and Enhancement Techniques
Model capacity is the dominant driver of Safe@1; parameter count positively correlates with robustness in risky settings. Fine-tuning on safety-oriented reasoning traces yields dramatic Safe@1 improvements—small models gained at least 300% in Safe@1 after such interventions. Explicit safety rule prompting further pushes Safe@1 for large models above 99%, although it can induce over-sensitivity, where benign content is incorrectly refused. Notably, decoding and sampling modifications (temperature, top-p, top-k) have minimal effect, confirming that Safe@1 is relatively insensitive to surface generation randomness (Zheng et al., 26 May 2025).
7. Safe@1 Preservation in Certified Neural Safety Layers
In the context of classification, the self-correcting network (SC-Net) framework establishes that safety enforcement via safe-ordering constraints can be realized without compromising top-1 accuracy. If a safe output exists with the original network’s top-1 intact, the safe-correcting layer guarantees
whenever safety and correctness are compatible. Empirical studies on ACAS Xu, collision detection, and CIFAR-100 demonstrate that SC-Nets can achieve absolute safety enforcement with zero loss in Safe@1, confirming theoretical transparency. The overhead for such corrections remains sub-millisecond, showing that practical safety augmentation does not hinder accuracy or performance (Leino et al., 2021).
Safe@1 accuracy provides a rigorous, deployment-relevant measure for the safety of large reasoning models in response to risky queries. It is extensively validated in challenging benchmarks, highlights limits in superficial safety alignment, is demonstrably improvable via targeted training, and is theoretically preserved in certified neural architectures. Limitations remain regarding the completeness of risk category coverage and the absence of multi-turn conversational contexts, motivating further methodological innovation (Zheng et al., 26 May 2025).