LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling
Zecheng Tang, Baibei Ji, Quantong Qiu, Haitian Wang, Xiaobo Liang, Juntao Li, Min Zhang
2025-10-10
Summary
This paper focuses on how well reward models, which are used to train large language models to align with what humans want, can handle long conversations or complex contexts. It introduces a new way to test these models and a method to improve them for these longer interactions.
What's the problem?
Currently, reward models are good at judging short responses from language models, focusing on things like safety and helpfulness. However, they struggle when the language model has a long history of conversation to consider. It's important that a model's response isn't just good on its own, but also makes sense *given* everything that was said before. Existing reward models aren't very good at checking for this consistency in long contexts, and often give incorrect preference judgments when the context is lengthy.
What's the solution?
The researchers created a new benchmark called Long-RewardBench to specifically test reward models on long-context scenarios. They found that even the best existing models weren't very reliable. Based on what they learned from these failures, they developed a new training method, called a multi-stage training strategy, to build better reward models for long contexts, which they call LongRMs. This method allows models of various sizes to become more robust when dealing with long conversations.
Why it matters?
This work is important because as language models are used in more complex applications like AI agents that have ongoing conversations, it becomes crucial that they remember and respond consistently with what's already been discussed. The new LongRM models developed in this paper perform surprisingly well, even outperforming much larger models and matching the quality of a leading proprietary model, meaning we can build more reliable and helpful AI systems that can handle complex, extended interactions.
Abstract
Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.