Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning
Shuyao Xu, Cheng Peng, Jiangxuan Long, Weidi Xu, Wei Chu, Yuan Qi
2025-06-02
Summary
This paper talks about a new technique called Reinforcement Distillation (REDI) that helps large language models get better at reasoning by learning from both good and bad examples given by a 'teacher' AI.
What's the problem?
The problem is that most AI models usually learn only from correct answers, which means they miss out on important lessons that could be learned from mistakes or wrong answers, making their reasoning skills less strong.
What's the solution?
The researchers developed REDI, which allows the AI to learn from both positive (correct) and negative (incorrect) examples, even when working offline with limited data. By paying attention to what not to do as well as what to do, the model becomes much better at reasoning and can reach top performance using less data than usual.
Why it matters?
This is important because it shows that teaching AI from both successes and failures makes it smarter and more reliable, which is useful for building better AI assistants, tutors, and any system that needs strong reasoning skills.
Abstract
Reinforcement Distillation (REDI) leverages both positive and negative traces to enhance large language model reasoning performance offline, outperforming traditional methods and achieving state-of-the-art results with limited open data.