Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen

2025-02-11

Exploring the Limit of Outcome Reward for Learning Mathematical
Reasoning

Summary

This paper talks about OREAL, a new method for teaching AI models to solve complex math problems using reinforcement learning. The researchers found a way to make smaller AI models perform as well as much larger ones on difficult math tasks.

What's the problem?

Current AI models struggle with complex reasoning tasks, especially in math. While some companies have made progress, they haven't shared all the details of how they did it. It's also hard to teach AI using reinforcement learning for math problems because the feedback (right or wrong answer) doesn't give much information about the thinking process.

What's the solution?

The researchers created OREAL, which uses a special type of reinforcement learning. It focuses on learning from correct solutions and reshaping how the AI learns from incorrect ones. They also developed a way to identify important steps in the problem-solving process. Using OREAL, they were able to make a smaller AI model (7 billion parameters) perform as well as much larger models (32 billion parameters) on a tough math test.

Why it matters?

This matters because it shows we can make AI better at complex reasoning without needing huge, expensive models. It could lead to more efficient and capable AI systems for solving difficult problems in math, science, and other fields. The researchers are sharing their code and data, which could help other scientists improve AI reasoning abilities even further.

Abstract

Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through Outcome REwArd-based reinforcement Learning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future researchhttps://github.com/InternLM/OREAL.

View Paper