Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, Tuo Zhao

2025-05-23

Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Summary

This paper talks about a new system called Think-RM that helps large language models get better at thinking through long and complicated problems, especially when trying to match what humans want.

What's the problem?

The problem is that most language models have trouble keeping track of and reasoning through tasks that require many steps or decisions over a long period, which makes it hard for them to consistently align with human preferences.

What's the solution?

The researchers created Think-RM, a framework that gives these models better long-term reasoning skills by using a new way of training called pairwise RLHF, which stands for Reinforcement Learning from Human Feedback. This helps the models make better decisions and produce results that are more in line with what people actually want.

Why it matters?

This is important because it means AI can be more reliable and helpful in situations that need careful, step-by-step reasoning, making them more useful for real-life tasks that aren't just simple or short.

Abstract

Think-RM is a framework that enhances generative reward models with long-horizon reasoning and a novel pairwise RLHF pipeline to improve end-policy performance in aligning large language models with human preferences.

View Paper