Model-Task Alignment Drives Distinct RL Outcomes
Haoze Wu, Cheng Wang, Wenshuo Zhao, Junxian He
2025-09-01
Summary
This paper investigates some strange but promising results that have been seen when using reinforcement learning to improve large language models, like the ones powering chatbots. It tries to figure out *why* these things work sometimes, and why they don't work other times.
What's the problem?
Recently, researchers noticed some weird things happening when they tried to train large language models using reinforcement learning. For example, sometimes just *one* example was enough to get good results, or training the model only on what it did *wrong* worked just as well as giving it rewards for doing things right. The problem was nobody understood *when* these surprising shortcuts would actually work, and when they would fail, making it hard to rely on them.
What's the solution?
The researchers found that these unusual results happen when the language model *already* has a pretty good understanding of the task it's being asked to do. They call this 'Model-Task Alignment'. If the model already knows a lot about what's expected, these shortcuts can be really effective. However, if the task is difficult and the model is starting from scratch, the standard, more reliable reinforcement learning methods are still the best way to go. They tested this idea with different models and different tasks to prove it.
Why it matters?
This work is important because it helps us understand the limits of these new, potentially faster and cheaper ways to train language models. It shows that these techniques aren't magic bullets – they only work under specific conditions. Knowing this allows researchers to use them effectively when appropriate, and to stick with more proven methods when the task is too challenging for these shortcuts to work.
Abstract
Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations hold - and, critically, when they fail - remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.