The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan
2025-05-30
Summary
This paper talks about how large language models can still learn to reason well even when the feedback or rewards they get during training are noisy or not always correct.
What's the problem?
The problem is that when training AI to solve problems or reason through questions, the feedback it receives can sometimes be unreliable or inconsistent, which could make it hard for the AI to learn the right way to think.
What's the solution?
The researchers found that these models are actually pretty tough and can handle a lot of noisy feedback if they focus on rewarding good reasoning patterns. By using a method called reasoning pattern rewards (RPR) along with the usual, sometimes noisy, rewards, the models still learn to reason at a high level.
Why it matters?
This is important because it means AI can be trained effectively even when perfect feedback isn't available, making it easier and cheaper to build smart systems that can think things through, even in messy or real-world situations.
Abstract
LLMs exhibit robustness to reward noise during post-training and achieve high performance using reasoning pattern rewards (RPR) in conjunction with noisy reward models.