SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law (2507.18576v2)

Published 24 Jul 2025 in cs.AI, cs.CL, and cs.CV

Abstract: We introduce SafeWork-R1, a cutting-edge multimodal reasoning model that demonstrates the coevolution of capabilities and safety. It is developed by our proposed SafeLadder framework, which incorporates large-scale, progressive, safety-oriented reinforcement learning post-training, supported by a suite of multi-principled verifiers. Unlike previous alignment methods such as RLHF that simply learn human preferences, SafeLadder enables SafeWork-R1 to develop intrinsic safety reasoning and self-reflection abilities, giving rise to safety `aha' moments. Notably, SafeWork-R1 achieves an average improvement of $46.54\%$ over its base model Qwen2.5-VL-72B on safety-related benchmarks without compromising general capabilities, and delivers state-of-the-art safety performance compared to leading proprietary models such as GPT-4.1 and Claude Opus 4. To further bolster its reliability, we implement two distinct inference-time intervention methods and a deliberative search mechanism, enforcing step-level verification. Finally, we further develop SafeWork-R1-InternVL3-78B, SafeWork-R1-DeepSeek-70B, and SafeWork-R1-Qwen2.5VL-7B. All resulting models demonstrate that safety and capability can co-evolve synergistically, highlighting the generalizability of our framework in building robust, reliable, and trustworthy general-purpose AI.

Summary

The paper presents SafeWork-R1, achieving a 46.54% boost on safety benchmarks using a verifier-guided RL approach under the AI-45° Law.
It adopts a staged SafeLadder framework—including CoT-SFT, M³-RL, and Safe-and-Efficient RL—with human-in-the-loop corrections for integrated safety and reasoning.
Empirical results demonstrate enhanced safety, value alignment, and general task performance, outperforming leading proprietary models in adversarial settings.

SafeWork-R1: Coevolving Safety and Intelligence under the AI-45 $^{\circ}$ Law

Introduction and Motivation

SafeWork-R1 is a multimodal reasoning model developed to address the persistent trade-off between safety and intelligence in LLMs, particularly in the context of the AI-45 $^{\circ}$ Law, which posits that safety and capability should co-evolve rather than diverge. The model is built upon the SafeLadder framework, which integrates progressive, safety-oriented reinforcement learning (RL) post-training, guided by a suite of neural and rule-based verifiers. This approach is designed to internalize safety as a native capability, enabling the model to exhibit intrinsic safety reasoning and self-reflection, including the emergence of "safety aha moments"—spontaneous insights indicative of deeper safety reasoning.

SafeWork-R1 demonstrates substantial improvements over its base model (Qwen2.5-VL-72B), achieving a 46.54% average increase on safety-related benchmarks without compromising general capabilities. It also outperforms leading proprietary models such as GPT-4.1 and Claude Opus 4 in safety metrics, while maintaining competitive performance on general reasoning and multimodal tasks.

Figure 1: Performance comparison on safety and general benchmarks.

SafeLadder Framework: Technical Roadmap

The SafeLadder framework is a staged training pipeline designed to internalize safety and capability in LLMs through the following phases:

CoT-SFT (Chain-of-Thought Supervised Fine-Tuning): Instills structured, human-like reasoning using high-quality, long-chain reasoning data, validated through multi-stage filtering and cognitive diversity analysis.
M $^3$ -RL (Multimodal, Multitask, Multiobjective RL): Employs a two-stage curriculum with a custom CPGD algorithm and multiobjective reward functions to jointly optimize safety, value, knowledge, and general capabilities.
Safe-and-Efficient RL: Introduces the CALE (Conditional Advantage for Length-based Estimation) algorithm to promote efficient, concise reasoning, reducing the risk of unsafe intermediate steps.
Deliberative Search RL: Enables the model to interact with external knowledge sources, calibrating confidence and reliability through constrained RL with dynamic reward balancing.
Figure 2: The roadmap of SafeLadder.

This framework is supported by a scalable RL infrastructure (SafeWork-T1) that enables verifier-agnostic, high-throughput training across thousands of GPUs.

Figure 3: System layer overview of SafeWork-T1, supporting scalable RL with verifiable rewards.

Verifier Suite: Safety, Value, and Knowledge

Safety Verifier

A bilingual, multimodal verifier trained on 45k high-quality samples, covering 10 major and 400 subcategories of safety risks. It achieves leading accuracy and F1 scores on public and proprietary safety benchmarks, outperforming both open-source and proprietary baselines.

Value Verifier

An interpretable, multimodal reward model trained on 80k samples across 70+ value-related scenarios. It provides both binary (good/bad) judgments and CoT-style reasoning, achieving state-of-the-art performance on value alignment benchmarks.

Figure 4: Data construction pipeline and value taxonomy visualization for the Value Verifier.

Knowledge Verifier

Designed to penalize speculative, low-confidence correct answers, the knowledge verifier is trained on 120k multimodal STEM questions. It labels responses by correctness and confidence, directly addressing the issue of "lucky guesses" in RLVR paradigms.

Figure 5: The development workflow of the knowledge verifier, penalizing low-confidence correct answers.

M $^3$ -RL: Multimodal, Multitask, Multiobjective RL

M $^3$ -RL is a two-stage RL framework:

Stage 1: Focuses on general capability enhancement.
Stage 2: Jointly optimizes safety, value, and general capability using a mixed reward function.

The CPGD algorithm ensures stable policy updates, while the multiobjective reward function balances visual grounding, helpfulness, format, and task-aware objectives.

Figure 6: Overview of the M $^3$ -RL training framework.

Data augmentation strategies, including multimodal jailbreaks, further improve robustness against adversarial inputs.

Figure 7: M $^3$ -RL data augmentation pipeline.

Inference-Time Interventions

SafeWork-R1 incorporates two inference-time alignment mechanisms:

Automated Intervention via Principled Value Model (PVM) Guidance: At each generation step, candidate continuations are scored by specialized value models (safety, value, knowledge), weighted by a context-specific routing vector. This enables dynamic, step-level alignment with safety and value principles.
Figure 8: PVM guidance mechanism for inference-time alignment.
Human-in-the-Loop CoT Editing: Users can directly edit the model's chain-of-thought, with the system tracking and integrating corrections to improve future responses and generalization.
Figure 9: Framework of human intervention on CoT.

Empirical Results

Safety and Value Alignment

SafeWork-R1 achieves an average safety rate of 89.2% across four multimodal safety benchmarks, outperforming GPT-4.1 and Claude Opus 4. On value alignment (FLAMES), it scores 65.3%, a 26.2% improvement over its base model.

General Capability

On general reasoning benchmarks (MMMU, MathVista, Olympiad, GPQA Diamond, GAOKAO-MM), SafeWork-R1 consistently outperforms its base model and is competitive with leading proprietary models, demonstrating that safety improvements do not compromise general intelligence.

Efficient Reasoning and Safety Aha Moments

Safe-and-efficient RL not only improves token efficiency but also enhances safety and value alignment. Information-theoretic analysis reveals the emergence of mutual information (MI) peaks at safety-relevant tokens, indicating that safety signals are being internalized in the model's intermediate representations.

Figure 10: (a) Illustration of safety mutation information peaks. (b) Distribution of tokens at MI peaks for SafeWork-R1-Qwen2.5VL-7B.

Figure 11: (a) Token efficiency comparison. (b) Safety and value performance with/without efficient reasoning.

Figure 12: Frequency of tokens at Safety MI peaks under different training regimes.

Red Teaming and Jailbreak Robustness

SafeWork-R1 demonstrates high harmless response rates (HRR) under both single-turn and multi-turn jailbreak attacks, matching or exceeding the safety of leading proprietary models.

Search with Calibration

In knowledge-intensive tasks requiring external search, SafeWork-R1 maintains high reliability and low false-certain rates, outperforming both open-source and proprietary baselines in reliability metrics.

Human Evaluation

Human studies confirm that SafeWork-R1 is perceived as more trustworthy, rational, and less prone to negative or deceptive communication strategies compared to other models.

Figure 13: Distribution of all models in different dimensions of the human evaluation framework.

RL Infrastructure: SafeWork-T1

The SafeWork-T1 platform enables efficient, scalable RL with verifiable rewards, supporting dynamic colocation of training, rollout, and verification workloads. It achieves over 30% higher throughput than existing RLHF frameworks and supports rapid prototyping of new verifiers and reward models.

Figure 14: RLVR training pipeline of SafeWork-T1, featuring universal colocation and dynamic data balance.

Implications and Future Directions

SafeWork-R1 demonstrates that safety and general capability can be co-optimized in large-scale, multimodal LLMs. The staged, verifier-guided RL approach enables the emergence of intrinsic safety reasoning without sacrificing general intelligence. The integration of inference-time alignment and human-in-the-loop correction further enhances real-world trustworthiness and adaptability.

Key implications include:

Scalability: The SafeLadder framework generalizes across model architectures and scales, supporting both language-only and multimodal models.
Efficiency-Safety Synergy: Efficient reasoning protocols not only reduce computational cost but also improve safety and value alignment.
Internalization of Safety: Information-theoretic analysis provides evidence that safety is being encoded in the model's intermediate representations, opening avenues for explainable safety in LLMs.
Infrastructure: The RLVR platform resolves the efficiency-flexibility trade-off in RLHF, enabling rapid development and deployment of safer, more reliable AI systems.

Future work will focus on expanding error correction and generalization capabilities, developing user-aligned adaptation techniques, and further investigating the linguistic and communicative strategies that underpin trustworthy AI interactions.

Conclusion

SafeWork-R1, developed under the SafeLadder framework, establishes a new paradigm for the coevolution of safety and intelligence in large-scale LLMs. Through progressive, verifier-guided RL and advanced inference-time interventions, it achieves state-of-the-art safety without compromising general capability. The empirical and analytical results provide a foundation for future research into scalable, trustworthy, and explainable AI systems, with broad implications for the deployment of safe and reliable general-purpose AI.

PDF Markdown

Follow-up Questions

Related Papers

Authors (117)

First 10 authors:

Tweets

https://twitter.com/yabyab887/status/1948633275280597377

https://twitter.com/adamjsmith/status/1948927102159409608

alphaXiv

SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law (38 likes, 0 questions)