Rethinking Expert Trajectory Utilization in LLM Post-training
Bowen Ding, Yuhan Chen, Jiayang Lv, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, Tao Lin
2025-12-16
Summary
This paper investigates the best way to improve large language models after they've been initially trained, specifically by using examples of how experts would perform certain tasks. It focuses on figuring out how to best combine initial 'supervised fine-tuning' (SFT) with a more advanced technique called 'reinforcement learning' (RL).
What's the problem?
Currently, it's unclear how to best use examples from experts to further train these models. There are different approaches – you can train with the expert examples first, then use reinforcement learning, or try to do both at the same time. The problem is that doing both at once can be unstable and doesn't always lead to the best results. The paper aims to understand *when* to switch from learning from the expert examples to using reinforcement learning, and how much expert data is needed.
What's the solution?
The researchers developed a theoretical framework called the 'Plasticity-Ceiling Framework' to analyze this process. They found that consistently training with expert examples *first* (SFT) and *then* using reinforcement learning (RL) works best. They also discovered that the timing of when you switch to reinforcement learning is crucial: it's best to do so while the model is still learning from the expert data, but before it starts to memorize it too much. Finally, they showed that the *amount* of expert data is more important than how difficult the tasks in that data are, and that you can use how well the model performs on a validation set to pick the best expert examples.
Why it matters?
This research provides practical guidelines for anyone trying to improve large language models using expert data. It helps determine the optimal strategy for combining different training techniques and maximizing the model's performance. This is important because it allows us to get the most out of valuable expert knowledge and build more capable and reliable AI systems.
Abstract
While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.