A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning

Hiroshi Yoshihara, Taiki Yamaguchi, Yuichi Inoue

2025-07-15

A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy
with SFT and Efficiency with Reinforcement Learning

Summary

This paper talks about a practical method for training large language models to get better at solving math problems. It combines two training steps, one that fine-tunes the model with lots of examples and another that uses reinforcement learning to make the model more efficient and accurate.

What's the problem?

The problem is that models trained only with supervised fine-tuning may not reach their full potential in math reasoning, and reinforcement learning alone can be unstable or drop accuracy. There wasn’t a clear way to combine these two to get the best of both worlds.

What's the solution?

The solution was to first use extended supervised fine-tuning to push the model’s accuracy as high as possible. After that, they used a special type of reinforcement learning called GRPO to make the model generate shorter, more efficient solutions without losing accuracy. This two-step process was tested and worked very well on tough math challenges, including a prestigious math competition for AI models.

Why it matters?

This is important because it shows a reliable way to train AI models that can solve math problems more accurately and efficiently. This helps advance AI capabilities in fields that require strong reasoning skills and makes these models more useful in real-world tasks.

Abstract

A combined training approach using extended supervised fine-tuning and reinforcement learning from online inference enhances the mathematical reasoning of large language models, achieving top performance in benchmarks like the AI Mathematical Olympiad.

View Paper