RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, Ronghao Chen

2026-01-13

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Summary

This paper introduces a new way to test how well AI agents, specifically large language models, can remember things over long periods, especially when working on complex projects.

What's the problem?

Current tests for AI memory focus on simple conversations or single tasks. They don't really challenge AI to remember details and track progress over a longer time, like when you're working on a multi-step project that takes days or weeks. Existing benchmarks don't accurately reflect how we actually *use* AI in real-world, ongoing projects where goals change and new information constantly comes in.

What's the solution?

The researchers created a new benchmark called RealMem. This benchmark uses over 2,000 simulated conversations across 11 different project scenarios, like planning a trip or writing a report. They built a system to automatically create these conversations, making them feel realistic and dynamic. This system simulates how projects evolve and how memory needs to be updated as things change. They then tested existing AI memory systems with RealMem to see how well they performed.

Why it matters?

This work is important because it highlights that current AI memory systems aren't very good at handling the complexities of long-term projects. If we want AI to be truly helpful assistants that can manage ongoing tasks, we need to improve their ability to remember and adapt to changing information over extended periods. This benchmark provides a tool for researchers to develop and test better memory systems for AI agents.

Abstract

As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture **"long-term project-oriented"** interactions where agents must track evolving goals. To bridge this gap, we introduce **RealMem**, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects. Our code and datasets are available at [https://github.com/AvatarMemory/RealMemBench](https://github.com/AvatarMemory/RealMemBench).

View Paper