Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

2025-07-08

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Summary

This paper talks about MemoryAgentBench, a new benchmark designed to test how well large language model agents remember, update, and use information over multiple conversations. It focuses on four important memory skills: retrieving accurate information, learning during use, understanding long conversations, and resolving conflicts in stored data.

What's the problem?

The problem is that most existing tests for AI agents focus on reasoning and planning, but they don’t measure memory well. Current benchmarks either only deal with short memories or with static long texts, which don’t match the interactive and multi-step memory needs of real agents that must manage information over time.

What's the solution?

The researchers created MemoryAgentBench by combining restructured existing datasets with new ones to cover all four memory skills. Their benchmark simulates multi-turn conversations where models gradually receive and process information. They tested various memory agents including simple and advanced models, showing that current systems still struggle with some memory tasks and highlighting where improvements are needed.

Why it matters?

This matters because good memory is essential for AI agents to handle real-world tasks that require remembering past interactions, learning on the go, and updating knowledge. MemoryAgentBench provides a clear way to measure and improve these memory capabilities in AI models.

Abstract

MemoryAgentBench is a new benchmark designed to evaluate four core competencies of memory agents in Large Language Models, highlighting the need for improved memory mechanisms.

View Paper