LMEB: Long-horizon Memory Embedding Benchmark
Xinping Zhao, Xinshuo Hu, Jiaxin Xu, Danyu Tang, Xin Zhang, Mengjia Zhou, Yan Zhong, Yao Zhou, Zifei Shan, Meishan Zhang, Baotian Hu, Min Zhang
2026-03-16
Summary
This paper introduces a new way to test how well computer programs understand and remember information over long periods, specifically when that information is scattered and relies on understanding context and time.
What's the problem?
Current tests for evaluating how well programs store and retrieve information, called 'embeddings,' are too simple. They mostly focus on finding relevant passages of text quickly, but don't check if the program can remember things from far back, understand how pieces of information connect, or handle situations where the meaning changes over time. This means we don't really know how well these programs can act as a reliable 'memory' for more complex tasks.
What's the solution?
The researchers created a new benchmark called LMEB, which stands for Long-horizon Memory Embedding Benchmark. It includes 22 different datasets and 193 tasks designed to test a program's ability to remember different types of information – like personal experiences, conversations, general knowledge, and step-by-step instructions – across varying lengths of time and levels of detail. They then tested 15 popular embedding models on this benchmark.
Why it matters?
This work is important because it shows that simply making a model bigger doesn't automatically make it better at remembering things long-term. It also reveals that current tests don't accurately predict how well a program will perform in real-world scenarios requiring complex memory retrieval. By providing a more challenging and realistic test, LMEB will help researchers develop better 'memory' systems for computers, leading to advancements in areas like AI assistants and long-form content understanding.
Abstract
Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.