Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
2025-04-03
Summary
This paper takes a closer look at how AI models can learn to reason better using a method similar to how the DeepSeek model was trained, focusing on what parts of the training process are most important.
What's the problem?
It's not entirely clear why the R1-Zero training method works so well for improving AI reasoning, and what specific parts of the process contribute the most to its success.
What's the solution?
The researchers analyzed different AI models trained with the R1-Zero method, looking at how their initial characteristics and the training process itself affected their performance. They also developed a new, more efficient training method that addresses some of the problems they found.
Why it matters?
This work matters because it helps us better understand how to train AI models to reason more effectively, which could lead to improvements in many areas, such as problem-solving and decision-making.
Abstract
DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.