Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models
Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
2025-12-17
Summary
This paper introduces a new method, called Cascade RL, for training powerful AI models that can reason and solve problems across many different areas, like coding and complex problem-solving.
What's the problem?
Training AI to be generally smart is hard because different tasks require the AI to respond in very different ways – sometimes short answers are enough, other times it needs to think through a long, detailed process. This variation makes it difficult to build the training systems and figure out the best settings for learning, slowing down the whole process and making it hard to improve the AI's abilities step-by-step.
What's the solution?
Instead of trying to train the AI on everything at once, the researchers used a step-by-step approach called Cascade RL. They first trained the model to generally follow instructions using human feedback, then they trained it on specific areas, one at a time, using reinforcement learning. This means the AI gets rewards for making good decisions in each area. By focusing on one area at a time, they simplified the training process and made it more effective.
Why it matters?
This research is important because it shows a way to build AI models that are truly good at a wide range of tasks. The resulting model, Nemotron-Cascade, performs very well on challenging benchmarks, even beating its original teacher model and achieving high scores in competitive programming. The techniques used are also shared openly, allowing other researchers to build on this work and create even more capable AI systems.
Abstract
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.