GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
Wangjie Gan, Miao Pan, Linbo Xi, Wenqi Zhang, Jintao Chen, Jianwei Yin, Xuhong Zhang
2026-04-21
Summary
This paper investigates problems with how we currently improve large language models after their initial training, specifically focusing on techniques called supervised fine-tuning and reinforcement learning. It proposes a new method, Group Fine-Tuning, to make this improvement process more stable and effective.
What's the problem?
Currently, improving language models with supervised fine-tuning can be unreliable. The process often gets stuck focusing on just a few successful examples, loses the variety in its responses, and can experience unstable learning where small changes lead to big swings in performance. This happens because the way the model learns from examples isn't ideal – it's like giving very limited and inconsistent feedback, making it hard for the model to generalize well and build on its existing knowledge.
What's the solution?
The researchers developed Group Fine-Tuning, or GFT, which tackles these issues in two main ways. First, it creates groups of good responses to a prompt and uses these groups to provide more balanced and informative feedback, preventing the model from fixating on just one path. Second, it adjusts how much weight the model gives to different examples during learning, preventing the learning process from becoming too erratic and ensuring the model still benefits from its initial training. Essentially, GFT provides more consistent and reliable guidance during the improvement phase.
Why it matters?
This research is important because it offers a more stable and effective way to refine large language models. By addressing the limitations of current methods, GFT can lead to models that are better at generating diverse, high-quality text and are more easily integrated with further learning processes like reinforcement learning, ultimately making these powerful AI tools more reliable and useful.
Abstract
Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.