Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

Hayate Iso, Tiyasa Mitra, Sudipta Mondal, Rasoul Shafipour, Venmugil Elango, Terry Kong, Yuki Huang, Seonjin Na, Izzy Putterman, Benjamin Chislett, Maor Ashkenazi, Joseph Guman, Gerald Shen, Tugrul Konuk, Ashwath Aithal, Ritika Borkar, Ran Zilberstein, Bita Rouhani

2026-04-30

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

Summary

This paper focuses on speeding up the process of training large language models using reinforcement learning, specifically addressing a major slowdown caused by generating text samples (rollouts) needed for the learning process.

What's the problem?

When you're trying to improve a powerful language model with reinforcement learning, a big bottleneck is how long it takes to create the text the model uses to learn from. This is because the model has to generate these text samples one word at a time, which is slow, especially for very large models. Existing methods to speed things up often change *how* the model learns or generate samples, potentially affecting the quality of the learning.

What's the solution?

The researchers explored a technique called 'speculative decoding' to accelerate this rollout generation. Think of it like having a faster, smaller model quickly *draft* a possible continuation of the text, and then the main, powerful model quickly checks and corrects it. This doesn't change the final output distribution of the main model, so it doesn't compromise learning quality. They built this into a system called NeMo-RL and showed it works with different 'draft' models, even ones designed for other purposes. They demonstrated a 1.8x speedup with an 8 billion parameter model and predict up to a 2.5x speedup with a much larger 235 billion parameter model using a more efficient setup.

Why it matters?

This work is important because it offers a way to significantly speed up the training of the most advanced language models. Faster training means we can iterate on these models more quickly, leading to better AI systems. The fact that it doesn't alter the core learning process and can be combined with other speedup techniques makes it a particularly promising approach for scaling up AI development.

Abstract

RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy execution, replay, or lower-precision generation. We study speculative decoding as a lossless acceleration primitive for RL rollouts that preserves the target model's output distribution. We implement speculative decoding in NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous pipelines and enabling speculation during RL rollouts. This benefit is realizable across speculation mechanisms, such as pretrained MTP heads, small external draft models or even techniques such as Eagle3, which are traditionally applied after RL phase. This yields a deployment path for state-of-the-art speculative decoding inside RL training. In a reasoning post-training workload at 8B scale under synchronous RL, speculative decoding improves rollout throughput by 1.8x. Using a high-fidelity performance simulator, we project that combining speculative decoding with asynchronous RL yields up to 2.5x end-to-end training speedup at 235B scale.

View Paper