Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum

2025-04-01

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement
Learning on the Base Model

Summary

This paper is about a new, open-source way to train AI models to reason better, focusing on making it easy to scale up and use.

What's the problem?

Training AI models to reason well can be difficult and expensive, and often relies on closed-source methods.

What's the solution?

The researchers created Open-Reasoner-Zero, which uses a simple approach and can achieve better performance with fewer training steps compared to previous methods.

Why it matters?

This work matters because it makes advanced AI reasoning training more accessible and efficient for researchers and developers.

Abstract

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE (lambda=1, gamma=1) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length and benchmark performance, similar to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency -- requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.

View Paper