On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
Xueyan Niu, Bo Bai, Wei Han, Weixi Zhang
2026-01-13
Summary
This paper investigates how well supervised fine-tuning and reinforcement learning work together when improving large language models, specifically those designed for reasoning. It challenges the common practice of alternating between these two training methods.
What's the problem?
Large language models are often improved after their initial training by using supervised fine-tuning, where the model learns from examples of correct answers, and reinforcement learning, where the model learns by receiving rewards for good responses. Researchers noticed that these two methods are usually done one after the other, but no one had proven *why* they need to be done in that order, or if they could even be done separately without losing performance. The core issue is whether improvements made by one method undo the benefits of the other.
What's the solution?
The researchers mathematically proved that you can't separate supervised fine-tuning and reinforcement learning. They showed that if you first do supervised fine-tuning and *then* reinforcement learning, the reinforcement learning actually makes the supervised fine-tuning worse. Conversely, if you do reinforcement learning first, the subsequent supervised fine-tuning lowers the rewards the reinforcement learning achieved. They then confirmed these findings with experiments on a specific language model called Qwen3-0.6B, showing that performance degrades when the methods are decoupled.
Why it matters?
This research is important because it provides a theoretical understanding of why alternating supervised fine-tuning and reinforcement learning is so effective. It explains that the two methods are fundamentally linked, and that trying to optimize them independently will likely lead to a less capable model. This knowledge can help researchers develop more efficient and effective training strategies for large language models, ultimately leading to better AI systems.
Abstract
Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under SFT optimality and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training