JustRL: Scaling a 1.5B LLM with a Simple RL Recipe
Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, Zhiyuan Liu
2025-12-19
Summary
This paper investigates whether the increasingly complex methods used to train large language models with reinforcement learning are actually necessary to achieve good results.
What's the problem?
Researchers have been making reinforcement learning for large language models more and more complicated, using multiple training stages, constantly changing settings, and carefully planned learning schedules. The core issue is that no one has really stopped to ask if all this extra effort is *actually* improving performance, or if it's just adding complexity for complexity's sake. It's possible these complex methods are trying to fix problems that wouldn't exist with a simpler, more stable approach.
What's the solution?
The authors developed a method called JustRL, which is remarkably simple. It uses a single training stage with fixed settings – meaning they don't change anything during training. Surprisingly, JustRL achieved top-level performance on two different language models when tested on math problems, and it did so using half the computing power of more complex methods. They found that commonly used 'tricks' to improve training could actually *hurt* performance, suggesting that a stable, well-scaled baseline is all that's needed.
Why it matters?
This work is important because it challenges the current trend of increasing complexity in reinforcement learning for language models. It suggests that we might be wasting resources and effort on unnecessary techniques. By providing a simple, effective baseline, JustRL gives researchers a starting point for future work and encourages a focus on fundamental scaling and stability rather than elaborate training procedures.
Abstract
Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: Is this complexity necessary? We present JustRL, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2times less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.