o3-mini vs DeepSeek-R1: Which One is Safer?
Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura
2025-01-31

Summary
This paper talks about comparing two advanced AI language models, DeepSeek-R1 and OpenAI's o3-mini, to see which one is safer to use. The researchers used a special tool to test how these AIs respond to potentially unsafe requests.
What's the problem?
As AI language models get more powerful, like DeepSeek-R1 which is really good at many tasks, it's important to make sure they're also safe to use. The problem is that we need to know if these new AI models will follow rules and behave safely, especially when compared to other well-known models like OpenAI's o3-mini.
What's the solution?
The researchers used a tool they created called ASTRAL to test both DeepSeek-R1 and o3-mini. They gave both AIs 1,260 tricky or unsafe requests to see how they would respond. Then, they carefully looked at the answers to figure out which AI was safer. They found that DeepSeek-R1 gave unsafe answers about 12% of the time, while o3-mini only did so about 1% of the time.
Why it matters?
This matters because as AI becomes more common in our daily lives, we need to make sure it's safe to use. The study shows that even though DeepSeek-R1 is really good at many tasks, it might not be as safe as o3-mini. This information can help companies and researchers improve AI safety, and it can also help people decide which AI models to trust and use. It's like a safety check for AI, making sure that as these systems get smarter, they also become more responsible.
Abstract
The irruption of DeepSeek-R1 constitutes a turning point for the AI industry in general and the LLMs in particular. Its capabilities have demonstrated outstanding performance in several tasks, including creative thinking, code generation, maths and automated program repair, at apparently lower execution cost. However, LLMs must adhere to an important qualitative property, i.e., their alignment with safety and human values. A clear competitor of DeepSeek-R1 is its American counterpart, OpenAI's o3-mini model, which is expected to set high standards in terms of performance, safety and cost. In this paper we conduct a systematic assessment of the safety level of both, DeepSeek-R1 (70b version) and OpenAI's o3-mini (beta version). To this end, we make use of our recently released automated safety testing tool, named ASTRAL. By leveraging this tool, we automatically and systematically generate and execute a total of 1260 unsafe test inputs on both models. After conducting a semi-automated assessment of the outcomes provided by both LLMs, the results indicate that DeepSeek-R1 is highly unsafe as compared to OpenAI's o3-mini. Based on our evaluation, DeepSeek-R1 answered unsafely to 11.98% of the executed prompts whereas o3-mini only to 1.19%.