Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models
Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, Tuo Zhao
2025-05-23
Summary
This paper talks about a new system called Think-RM that helps large language models get better at thinking through long and complicated problems, especially when trying to match what humans want.
What's the problem?
The problem is that most language models have trouble keeping track of and reasoning through tasks that require many steps or decisions over a long period, which makes it hard for them to consistently align with human preferences.
What's the solution?
The researchers created Think-RM, a framework that gives these models better long-term reasoning skills by using a new way of training called pairwise RLHF, which stands for Reinforcement Learning from Human Feedback. This helps the models make better decisions and produce results that are more in line with what people actually want.
Why it matters?
This is important because it means AI can be more reliable and helpful in situations that need careful, step-by-step reasoning, making them more useful for real-life tasks that aren't just simple or short.
Abstract
Think-RM is a framework that enhances generative reward models with long-horizon reasoning and a novel pairwise RLHF pipeline to improve end-policy performance in aligning large language models with human preferences.