Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, Haoxiang Wang
2025-05-27
Summary
This paper talks about a new training technique called Negative-aware Fine-Tuning (NFT) that helps large language models get better at solving math problems by learning from both correct and incorrect answers.
What's the problem?
The problem is that while reinforcement learning methods can help AI improve at math reasoning, they are often complicated and hard to use. On the other hand, regular supervised learning usually only teaches the model from correct answers, which doesn't help it learn from its mistakes.
What's the solution?
The authors introduced NFT, which is a way of training where the model also learns from negative feedback, or its own wrong answers, using supervised learning. This approach helps the model get better at math reasoning without needing the complex setup of reinforcement learning, and it performs just as well.
Why it matters?
This is important because it makes it easier and more efficient to train AI to solve math problems, allowing more people to use these models for tutoring, homework help, or research without needing advanced reinforcement learning techniques.
Abstract
Negative-aware Fine-Tuning (NFT) enhances LLMs' math abilities using supervised learning with negative feedback, achieving performance comparable to RL methods.