Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee

2025-10-31

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Summary

This paper introduces a new way to train smaller language models to become better at complex problem-solving, specifically tasks that require multiple steps of reasoning.

What's the problem?

Current methods for training these smaller models have limitations. Reinforcement Learning with Verifiable Rewards often fails because the model rarely finds the right answer even after many tries. Supervised Fine-Tuning, while sometimes successful, can cause the model to simply copy the training examples exactly, without truly understanding *how* to solve the problem, and struggles with longer, more complex tasks.

What's the solution?

The researchers developed a technique called Supervised Reinforcement Learning (SRL). Instead of trying to get the model to directly output the final answer, SRL trains the model to think through the problem by first generating a series of logical 'actions' – essentially an internal monologue explaining its reasoning. The model is then rewarded based on how closely these 'actions' resemble the reasoning steps of experts, even if the final answer isn't perfect. This provides more helpful feedback during training and encourages the model to learn a flexible approach to problem-solving.

Why it matters?

This work is important because it allows smaller, more accessible language models to tackle challenging reasoning problems that were previously beyond their capabilities. It also shows that combining SRL with other techniques like Reinforcement Learning can lead to even better results, and that this approach isn't limited to just reasoning tasks – it can also be applied to more practical applications like software engineering.

Abstract

Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.

View Paper