MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

Darshan Deshpande, Varun Gangal, Hersh Mehta, Anand Kannappan, Rebecca Qian, Peng Wang

2025-10-08

MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

Summary

This paper introduces a new way to test how well AI agents can remember and use information over long periods, especially when dealing with tasks that happen across different digital tools like Slack, email, and project management software.

What's the problem?

Current tests for AI memory mostly focus on conversations, but real-world jobs require agents to track information from many sources over time. It's hard to build AI that can reliably handle this kind of complex, constantly changing information, especially when things are messy, contradictory, or require understanding code and files. Existing benchmarks don't accurately reflect these challenges.

What's the solution?

The researchers created a benchmark called MEMTRACK. It simulates a realistic work environment with events happening on platforms like Slack, Linear, and Git. These events are mixed together in a timeline, and include confusing or conflicting information. They then tested several advanced AI models, including GPT-5, on their ability to correctly track and use this information, measuring not just if they get the right answer, but also how efficiently and without unnecessary repetition they do it. They built the test scenarios both by hand and using other AI agents to make it more realistic and scalable.

Why it matters?

This work is important because it shows that even the most powerful AI models still struggle with long-term memory and handling complex information in a work setting. It provides a better way to test and improve AI memory capabilities, moving beyond simple chat-based tests and paving the way for AI agents that can truly assist with real-world tasks in organizations.

Abstract

Recent works on context and memory benchmarking have primarily focused on conversational instances but the need for evaluating memory in dynamic enterprise environments is crucial for its effective application. We introduce MEMTRACK, a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments. MEMTRACK models realistic organizational workflows by integrating asynchronous events across multiple communication and productivity platforms such as Slack, Linear and Git. Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information as well as potential codebase/file-system comprehension and exploration. Consequently, our benchmark tests memory capabilities such as acquistion, selection and conflict resolution. We curate the MEMTRACK dataset through both manual expert driven design and scalable agent based synthesis, generating ecologically valid scenarios grounded in real world software development processes. We introduce pertinent metrics for Correctness, Efficiency, and Redundancy that capture the effectiveness of memory mechanisms beyond simple QA performance. Experiments across SoTA LLMs and memory backends reveal challenges in utilizing memory across long horizons, handling cross-platform dependencies, and resolving contradictions. Notably, the best performing GPT-5 model only achieves a 60\% Correctness score on MEMTRACK. This work provides an extensible framework for advancing evaluation research for memory-augmented agents, beyond existing focus on conversational setups, and sets the stage for multi-agent, multi-platform memory benchmarking in complex organizational settings

View Paper