On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou

2025-08-21

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Summary

This paper explores how to combine two methods, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), to make Large Language Models (LLMs) better after their initial training. It introduces a new technique called CHORD that carefully blends these methods to avoid common problems like messing up what the model already knows or making it too specialized in one area. CHORD treats SFT as a helpful assistant within the RL process, using smart ways to balance learning from expert examples and exploring new possibilities to improve the model's performance.

What's the problem?

When trying to improve Large Language Models (LLMs) using methods like Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), there's a risk. Combining these techniques can sometimes mess up the good things the model already learned or make it overfit, meaning it becomes too good at the specific examples it saw during training but bad at anything new. This paper highlights this challenge of keeping the model balanced and generalized while improving it.

What's the solution?

The researchers proposed a new framework called CHORD. Instead of treating SFT as a separate step, CHORD integrates it directly into the Reinforcement Learning (RL) process. It does this by treating RL as either 'on-policy' (learning from its own actions) or 'off-policy' (learning from past experiences, like expert data). CHORD uses a smart weighting system. First, a general setting guides the shift from learning from expert data to exploring on its own. Then, it uses more specific, token-by-token adjustments to learn from expert data without disrupting the model's ability to explore and adapt, thereby reducing the negative impacts of expert data.

Why it matters?

This research is important because it provides a more stable and efficient way to improve LLMs. By carefully blending different learning strategies, CHORD helps create models that are not only skilled but also well-behaved and can adapt to new situations without forgetting what they already know. This could lead to more reliable and capable AI systems in the future.

Abstract

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.

View Paper