Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas
2025-07-29
Summary
This paper talks about RLCR, a new method that helps language models not only get answers right but also understand how confident they should be about those answers when solving problems.
What's the problem?
The problem is that many language models can give answers that are sometimes wrong but still sound very confident, or they might be unsure even when their answers are correct. This mismatch makes it hard to trust their reasoning.
What's the solution?
RLCR uses a special training approach in reinforcement learning where the model gets rewards not just for being correct, but also for having confidence that matches its accuracy. This calibration reward guides the model to better judge when it knows something well and when it does not.
Why it matters?
This matters because models that understand their own uncertainty are more reliable and safer to use, especially in tricky reasoning tasks where being aware of what you don’t know can prevent mistakes.
Abstract
RLCR, a reinforcement learning approach with calibration rewards, enhances both accuracy and confidence calibration in language models for reasoning tasks.