How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

Zixian Huang, Kaichen Yang, Xu Huang, Feiyang Hao, Qiming Ge, Bowen Li, He Du, Kai Chen, Qipeng Guo

2026-04-17

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

Summary

This paper investigates why using data created by a powerful AI model to improve a less capable one doesn't always work, and proposes a new method to make it more effective.

What's the problem?

A common technique to boost AI performance is to have a strong AI model generate training data for a weaker one. However, researchers found this doesn't help, and actually *hurts* the performance of newer, reasoning-focused models like Qwen3-8B. The issue is that the way the strong AI writes (its 'style') is very different from how the weaker AI writes, causing confusion during training.

What's the solution?

The researchers developed a framework called TESSY, which stands for Teacher-Student Cooperation Data Synthesis. Instead of letting the strong AI write entire responses, TESSY has the teacher and student models work together. The teacher focuses on generating the core reasoning parts, while the student contributes the stylistic elements, making the generated data sound more natural for the student to learn from. This creates a blend of intelligence and familiarity.

Why it matters?

This research is important because it identifies a key limitation of a popular AI training method and offers a practical solution. By addressing the stylistic differences between AI models, TESSY significantly improves the ability of smaller models to learn and perform complex reasoning tasks, like writing code, leading to better overall AI performance.

Abstract

A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.

View Paper