KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, Ronghao Chen
2026-01-14
Summary
This paper introduces a new way to test how well AI can understand people over long periods of time, going beyond simple question-and-answer interactions.
What's the problem?
Current tests for long-term memory in AI often use conversations or fake histories, which aren't very good at showing if the AI *actually* understands a person's motivations and how they think. These tests focus too much on just remembering facts and not enough on understanding *why* someone did something or what they were feeling at a specific time.
What's the solution?
The researchers created a new dataset called KnowMeBench, which is based on detailed, real-life stories people have written about themselves. They then turned these stories into a timeline and asked the AI questions that required it to remember facts, understand emotions, and figure out the underlying principles guiding a person's actions. They tested systems that use 'retrieval,' meaning they look up information to answer questions, and found that while retrieval helps with facts, it struggles with understanding the context and deeper reasoning.
Why it matters?
This work is important because it highlights that simply being able to retrieve information isn't enough for AI to truly understand people. It shows the need for AI systems to develop more sophisticated memory systems that can track context, emotions, and motivations over time, which is crucial for building AI that can interact with us in a meaningful and helpful way.
Abstract
Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in KnowMeBench{https://github.com/QuantaAlpha/KnowMeBench}.