Papers
Topics
Authors
Recent
2000 character limit reached

Safe@1 Accuracy in Risk-Aware AI

Updated 11 January 2026
  • The paper defines Safe@1 accuracy as the fraction of safe responses from k sampled outputs, establishing it as a robust measure for deployment risk in large reasoning models.
  • Empirical evaluations show that top-tier models achieve Safe@1 values between 95–98%, while smaller or distilled models experience significant safety performance drops.
  • Enhancement techniques like safety-oriented fine-tuning and certified neural safety layers boost Safe@1 without sacrificing overall response accuracy.

Safe@1 accuracy is a metric central to the evaluation of safety in large reasoning models (LRMs) deployed in risk-sensitive applications. It quantifies the probability that a model will produce a response judged “safe” when presented with a single risky query, thereby directly reflecting user experience in real-world, single-call deployments. Safe@1 serves as a critical indicator for risk-aware AI evaluation and is sensitive to both model quality and deployment configuration, offering a rigorous alternative to superficial assessments of model safety (Zheng et al., 26 May 2025).

1. Formal Definition and Mathematical Formulation

Safe@1 is defined for a given risky query by measuring, across kk independent samples (default k=5k = 5), the fraction of responses judged safe. Let sri{0,1}sr_i \in \{0,1\} be the indicator for the iith sampled response:

Safe@1=1ki=1ksri\text{Safe@1} = \frac{1}{k} \sum_{i=1}^{k} sr_i

A response is counted as “safe” (sri=1sr_i = 1) if, according to an automated judge model (GPT-4o-1120), it refuses or safely redirects the risky request as per prescribed risk guidelines. This aggregation over multiple draws yields an empirical estimate of the probability that the model’s first response is in compliance with safety specifications (Zheng et al., 26 May 2025).

2. Rationale and Critical Importance

Top-1, or “@1,” safety accuracy is particularly germane for practical deployments. In interactive systems, the user typically issues a single prompt and expects one reply. Safe@1 captures the risk that a lone interaction results in unsafe or policy-violating content. While multi-sample approaches (Safe@k) can be useful for stress-testing worst-case scenarios, Safe@1 provides a robust measure of real-world risk exposure, directly mapping to the probability that the initial model output is safe. A high Safe@1 value is thus a prerequisite for reliable, risk-aware system deployment (Zheng et al., 26 May 2025).

3. Evaluation Protocol and Judging Criteria

The Beyond Safe Answers (BSA) benchmark operationalizes Safe@1 through a systematic protocol:

  • Sampling Regime: For each risky query, the model is sampled k=5k = 5 times, holding system and prompt settings constant. Each model employs its recommended decoding hyperparameters.
  • Safety Judging: Each output is scored as “safe” or “unsafe” using the GPT-4o-1120 LLM-as-judge in a Chain Exposure paradigm. Safety is contingent upon adherence to refusal, redirection, or other risk-mitigating guidelines specific to nine annotated risk categories.
  • Ablations: Variations in decoding strategy (temperature, top-p, top-k) were tested and found to have negligible impact on Safe@1, indicating that the metric is dominated by the model’s underlying training and architecture, not sampling randomness (Zheng et al., 26 May 2025).

4. Empirical Results Across Model Scales and Scenarios

Comprehensive Safe@1 measurements across 19 leading models reveal a strong dependence on parameter count and fine-tuning strategy. For representative open- and closed-source models:

Model Safe@1 (%) Param. Scale
Qwen3-30B-A3B 98.27 30B
Qwen3-14B 98.19 14B
Qwen3-0.6B 41.09 0.6B
R1-Distill-Qwen-1.5B 39.96 1.5B Distilled
Doubao-1.5-thinking-pro 92.97 Large Closed-source

For all models, Safe@1 was much higher in large, foundational architectures, with values between 95–98% for top-tier models. Distilled and small-parameter models suffered pronounced drops, aligning Safe@1 with model expressivity and safety alignment capacity (Zheng et al., 26 May 2025).

Additionally, analysis by SSA (Superficial Safety Alignment) scenario demonstrates that models exhibit variable safety behavior across different types of risk exposure, with “Risk Omission” scenarios particularly challenging for low-capacity models.

5. Distinction from Underlying Reasoning Fidelity

A critical observation in the BSA benchmark is the disparity between Safe@1 and the Think@1 metric, which measures reasoning correctness (the model’s ability to internalize and identify risk rationales). Top models achieve Safe@1 near 98%, yet their Think@1 accuracy (correct internal risk rationale identification) is only ~38%. This discrepancy exposes Superficial Safety Alignment, where models reliably refuse or redirect risky prompts but lack genuine, explicit recognition of the underlying risks. A plausible implication is that Safe@1 may overestimate the depth of safety reasoning, highlighting the necessity for additional metrics targeting internal risk awareness (Zheng et al., 26 May 2025).

6. Factors Affecting Safe@1 and Enhancement Techniques

Model capacity is the dominant driver of Safe@1; parameter count positively correlates with robustness in risky settings. Fine-tuning on safety-oriented reasoning traces yields dramatic Safe@1 improvements—small models gained at least 300% in Safe@1 after such interventions. Explicit safety rule prompting further pushes Safe@1 for large models above 99%, although it can induce over-sensitivity, where benign content is incorrectly refused. Notably, decoding and sampling modifications (temperature, top-p, top-k) have minimal effect, confirming that Safe@1 is relatively insensitive to surface generation randomness (Zheng et al., 26 May 2025).

7. Safe@1 Preservation in Certified Neural Safety Layers

In the context of classification, the self-correcting network (SC-Net) framework establishes that safety enforcement via safe-ordering constraints can be realized without compromising top-1 accuracy. If a safe output exists with the original network’s top-1 intact, the safe-correcting layer guarantees

argmaxifiΦ(x)=argmaxifi(x)\arg\max_i f^\Phi_i(x) = \arg\max_i f_i(x)

whenever safety and correctness are compatible. Empirical studies on ACAS Xu, collision detection, and CIFAR-100 demonstrate that SC-Nets can achieve absolute safety enforcement with zero loss in Safe@1, confirming theoretical transparency. The overhead for such corrections remains sub-millisecond, showing that practical safety augmentation does not hinder accuracy or performance (Leino et al., 2021).


Safe@1 accuracy provides a rigorous, deployment-relevant measure for the safety of large reasoning models in response to risky queries. It is extensively validated in challenging benchmarks, highlights limits in superficial safety alignment, is demonstrably improvable via targeted training, and is theoretically preserved in certified neural architectures. Limitations remain regarding the completeness of risk category coverage and the absence of multi-turn conversational contexts, motivating further methodological innovation (Zheng et al., 26 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Safe@1 Accuracy.