HaluMem: Evaluating Hallucinations in Memory Systems of Agents

Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, Zhiyu Li

2025-11-11

HaluMem: Evaluating Hallucinations in Memory Systems of Agents

Summary

This paper is about the problem of 'memory hallucinations' in AI systems, specifically those that use memory to learn and interact over time, like large language models and AI agents. These hallucinations aren't like seeing things; they're errors, fabrications, or omissions that happen when the AI stores and recalls information.

What's the problem?

Currently, it's hard to pinpoint *where* these memory errors are happening within an AI's memory system. Existing tests just check if the final answer is right or wrong, but don't tell us if the problem occurred when the AI first saved the information, when it updated it, or when it tried to retrieve it. This makes it difficult to improve the system because you don't know what part needs fixing.

What's the solution?

The researchers created a new testing tool called HaluMem. This tool doesn't just test the final answer; it tests each step of the memory process – saving information, updating information, and then answering questions based on that information. They also built two large datasets of conversations between people and AI to make the testing more realistic, with very long conversations and lots of information to manage. By using HaluMem, they found that errors often happen during the saving and updating stages, and these errors then get worse when the AI tries to answer questions.

Why it matters?

Understanding and fixing these memory hallucinations is crucial for building more reliable and trustworthy AI systems. If an AI consistently makes up facts or forgets important details, it can't be used effectively in real-world applications. This research points to the need for better ways to manage and control how AI systems store and retrieve information, making them more accurate and dependable.

Abstract

Memory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which makes it difficult to localize the operational stage within the memory system where hallucinations arise. To address this, we introduce the Hallucination in Memory Benchmark (HaluMem), the first operation level hallucination evaluation benchmark tailored to memory systems. HaluMem defines three evaluation tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long. Both include about 15k memory points and 3.5k multi-type questions. The average dialogue length per user reaches 1.5k and 2.6k turns, with context lengths exceeding 1M tokens, enabling evaluation of hallucinations across different context scales and task complexities. Empirical studies based on HaluMem show that existing memory systems tend to generate and accumulate hallucinations during the extraction and updating stages, which subsequently propagate errors to the question answering stage. Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability.

View Paper