Towards a Unified View of Large Language Model Post-Training

Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, Bowen Zhou

2025-09-05

Towards a Unified View of Large Language Model Post-Training

Summary

This paper explores how different methods for improving large language models after their initial training – specifically using data generated by the model itself versus data created by humans – are actually connected. It shows they’re not separate approaches, but different ways of achieving the same goal.

What's the problem?

Currently, there are two main ways to fine-tune language models after they’ve been pre-trained: Reinforcement Learning (RL) which uses data the model creates by trying things out, and Supervised Fine-Tuning (SFT) which uses examples provided by humans. It wasn’t clear if these methods were fundamentally different or if there was a unifying principle behind them. Essentially, researchers were treating them as separate techniques without understanding their deeper relationship.

What's the solution?

The researchers developed a new mathematical framework called a 'Unified Policy Gradient Estimator'. This framework demonstrates that both RL and SFT can be seen as different versions of the same optimization process, depending on the type of data used and how much risk is taken during learning. They also created a new algorithm, Hybrid Post-Training (HPT), that intelligently switches between using human-provided examples and letting the model explore on its own, aiming for the best of both worlds. HPT dynamically chooses which training signal to use.

Why it matters?

This work is important because it provides a more complete understanding of how to best improve language models. By showing the connection between RL and SFT, it opens the door to creating more effective and stable training methods. The HPT algorithm consistently performs better than existing methods on challenging reasoning tasks, suggesting a practical way to build more capable AI systems.

Abstract

Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.

View Paper