MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Zecheng Tang, Baibei Ji, Ruoxi Sun, Haitian Wang, WangJie You, Zhang Yijun, Wenpeng Zhu, Ji Qi, Juntao Li, Min Zhang

2026-01-21

MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Summary

This paper focuses on how well we can automatically judge if a large language model is effectively using its 'memory' when dealing with very long pieces of text.

What's the problem?

Large language models are getting better at handling long texts, but a key part of this is managing information effectively as they go. We need a reliable way to automatically check *how well* these models are remembering and using information from earlier in the text, because manually checking is impossible at these lengths. Current methods don't really give us a good way to assess this 'memory quality'.

What's the solution?

The researchers created a new benchmark called MemoryRewardBench. This benchmark tests how well different 'reward models' – which are used to automatically score the quality of a model’s output – can evaluate a language model’s memory management. They tested these reward models on a variety of tasks involving long texts, ranging from 8,000 to 128,000 tokens long, and different ways the model might need to use its memory. They evaluated 13 different reward models, both publicly available and from companies like OpenAI.

Why it matters?

This work is important because it shows us how well current automatic evaluation tools can actually assess a crucial ability of large language models – their ability to handle long contexts. It reveals that newer reward models are improving, but still have limitations, and provides a standard way to measure progress in this area. Ultimately, better evaluation of memory management will help us build more reliable and capable language models.

Abstract

Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce MemoryRewardBench, the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. MemoryRewardBench covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models consistently outperforming their predecessors regardless of parameter count. We further expose the capabilities and fundamental limitations of current RMs in evaluating LLM memory management across diverse settings.

View Paper