Learning to Reason without External Rewards

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song

2025-05-27

Learning to Reason without External Rewards

Summary

This paper talks about Intuitor, a new way to train large language models to improve their reasoning skills by using their own sense of confidence as a reward, instead of relying on outside feedback or correct answers.

What's the problem?

The problem is that most AI models need a lot of external rewards or correct answers to learn how to reason well, which can be hard to get, especially for complicated or new problems where there isn’t much training data.

What's the solution?

The researchers developed Intuitor, which lets the model judge its own answers and use its level of certainty as a guide for learning. This means the AI can teach itself to reason better without needing tons of labeled data, and it still performs as well as other top methods on standard tests.

Why it matters?

This is important because it could make AI models much easier and cheaper to train, helping them get better at reasoning on their own and making them more useful for solving new or difficult problems.

Abstract

Intuitor, a Reinforcement Learning from Internal Feedback method, uses self-certainty as a reward signal to enable unsupervised learning of large language models, achieving performance comparable to GRPO on benchmarks and superior generalization.

View Paper