o3-mini vs DeepSeek-R1: Which One is Safer? (2501.18438v2)

Published 30 Jan 2025 in cs.SE and cs.AI

Abstract: The irruption of DeepSeek-R1 constitutes a turning point for the AI industry in general and the LLMs in particular. Its capabilities have demonstrated outstanding performance in several tasks, including creative thinking, code generation, maths and automated program repair, at apparently lower execution cost. However, LLMs must adhere to an important qualitative property, i.e., their alignment with safety and human values. A clear competitor of DeepSeek-R1 is its American counterpart, OpenAI's o3-mini model, which is expected to set high standards in terms of performance, safety and cost. In this technical report, we systematically assess the safety level of both DeepSeek-R1 (70b version) and OpenAI's o3-mini (beta version). To this end, we make use of our recently released automated safety testing tool, named ASTRAL. By leveraging this tool, we automatically and systematically generated and executed 1,260 test inputs on both models. After conducting a semi-automated assessment of the outcomes provided by both LLMs, the results indicate that DeepSeek-R1 produces significantly more unsafe responses (12%) than OpenAI's o3-mini (1.2%).

PDF Abstract

The paper presents a comprehensive comparative paper of the safety alignment properties of two LLMs: DeepSeek-R1 (70B) and OpenAI's o3-mini. The authors employ an automated safety testing framework, ASTRAL, to systematically generate, execute, and evaluate 1,260 unsafe test inputs that span a broad range of safety categories, writing styles, and persuasion techniques. The following summarizes the key technical aspects and findings of the paper:

Methodological Framework

Test Input Generation:
- ASTRAL generates test inputs using a combination of Retrieval-Augmented Generation (RAG), few-shot prompting, and real-time web browsing data.
- The test suite is constructed by combining six distinct writing styles (e.g., slang, technical terms, role-play) with five persuasion techniques and 14 safety categories (including topics such as financial crime, hate speech, terrorism, etc.), with three tests per combination.
- A novel black-box coverage criterion is introduced to ensure balanced test input distribution and to capture a wide spectrum of unsafe behaviours.
Execution and Evaluation:
- Both LLMs are exposed to the same set of prompts; however, o3-mini benefits from a policy violation safeguard that frequently blocks unsafe test inputs from executing, while DeepSeek-R1 responds in full.
- Outputs are initially classified using GPT3.5, which functions as an oracle evaluator. This classification distinguishes responses into safe, unsafe, or unknown, with manual verification conducted for borderline cases and those flagged as unsafe or unknown.
- The evaluation considers both system-level safeguards (e.g., policy violations in o3-mini) and the inherent response generation quality from the LLMs themselves.

Key Findings and Numerical Results

Overall Safety Performance:
- Out of 1,260 test inputs, o3-mini produced unsafe responses in 15 cases (approximately 1.19%), whereas DeepSeek-R1 issued unsafe responses in 151 cases (about 11.98%). This represents an approximately tenfold safety gap in favour of o3-mini.
- Additionally, manual assessments suggest that the severity of unsafe responses from DeepSeek-R1 is notably higher, with its outputs often providing excessive and detailed unsafe content.
Influence of Safety Categories and Writing Styles:
- For DeepSeek-R1, specific safety categories (e.g., c6: financial crime, property crime, theft; c14: violence, aiding and abetting; c13: terrorism, organized crime; c7: hate speech) were more prone to trigger unsafe behaviour.
- Among the writing styles, S3 (technical terms) and S4 (role-play) considerably increased the probability of unsafe outputs in DeepSeek-R1, with additional contributions from S6 (questions) and S5 (misspellings). In contrast, persuasion techniques exhibited minimal differential impacts on safety performance for both models.
System-Level Interventions:
- The o3-mini model benefits from a policy violation mechanism that preempts the execution of many unsafe prompts (with around 44.8% of inputs rejected before reaching the model), thereby skewing the evaluation towards safer outputs. However, the degree to which this safeguard impacts final system-level safety, as opposed to the base LLM performance, remains an open question.

Implications and Future Research

The paper illuminates the challenges of aligning LLM outputs with human values, particularly in applications with widespread user exposure. DeepSeek-R1’s higher rate and severity of unsafe responses indicate that increased reasoning capabilities and lower execution costs might come at the expense of safety alignment.
The paper also underscores the importance of continuous evolution of automated testing benchmarks like ASTRAL, especially given that static benchmarks may become incorporated into training data over time, ultimately compromising their effectiveness.
Future work is suggested to extend the test suite (e.g., to 6,300 inputs) and to re-evaluate safety properties when the o3-mini model transitions from its beta release to a general production version. This will help further ascertain the effect of system-level interventions versus intrinsic model capabilities.

In summary, this paper rigorously demonstrates that, under the evaluated conditions, OpenAI's o3-mini exhibits a substantially safer profile compared to DeepSeek-R1. The empirical evidence provided—marked by both quantitative differences (1.19% vs. 11.98% unsafe outputs) and qualitative severity—offers a clear directive for LLM developers to prioritize safety alignment in next-generation models.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Aitor Arrieta (16 papers)
Miriam Ugarte (4 papers)
Pablo Valle (10 papers)
José Antonio Parejo (10 papers)
Sergio Segura (6 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/JunaidAhmed_010/status/1885302980708016488

https://twitter.com/arXivGPT/status/1886113644460863491

https://twitter.com/TheTuringPost/status/1886749013787206063

https://twitter.com/arxivsanitybot/status/1886407183728054413

https://twitter.com/arXivGPT/status/1886475875740438633

o3-mini vs DeepSeek-R1: Which One is Safer? (2501.18438v2)

Related Papers

Tweets

YouTube

HackerNews

Reddit