SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, Junxian He

2025-03-25

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for
Open Base Models in the Wild

Summary

This paper talks about improving AI models' ability to reason and solve problems using a method called zero reinforcement learning, which doesn't require starting from scratch but instead builds on existing AI models.

What's the problem?

Many AI models struggle with complex reasoning tasks, and improving them often requires expensive and time-consuming training from scratch. Previous successful attempts to use zero reinforcement learning were limited to specific AI models, so it wasn't clear if this method would work for other types of AI.

What's the solution?

The researchers tested zero reinforcement learning on ten different AI models of various sizes and types. They made some key adjustments to the training process, like changing how the AI is rewarded for good answers and controlling the difficulty of the questions it's asked. They carefully watched how each AI model improved during training and noticed that different models learned in different ways.

Why it matters?

This work matters because it shows that zero reinforcement learning can be used to improve many different types of AI models, making them better at reasoning and problem-solving without the need for expensive training from scratch. It also helps us understand how different AI models learn, which could lead to even better training methods in the future. By sharing their findings and tools, the researchers are helping other scientists continue this important work.

Abstract

DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

View Paper