MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen

2026-03-28

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Summary

This paper introduces a new way to give AI models really long-term memory, allowing them to process and remember information over incredibly long periods, like an entire lifetime of data.

What's the problem?

Current AI models, specifically large language models, struggle to remember and use information from very long texts or histories. While there are attempts to extend their memory, they often become less accurate or much slower as the amount of information grows. Existing methods also don't easily allow the AI to change what it remembers or are difficult to fully optimize for performance, hindering tasks that require understanding vast amounts of information like summarizing huge documents or creating realistic simulations.

What's the solution?

The researchers developed a system called Memory Sparse Attention (MSA). This system uses a clever technique to focus on the most important parts of the long-term memory, making it efficient and scalable. It also compresses the memory to fit more information and allows the AI to quickly jump between different pieces of information when needed. Importantly, the entire system can be trained together, leading to better performance.

Why it matters?

This work is important because it overcomes a major limitation of current AI. By giving AI models truly long-term memory without sacrificing speed or accuracy, it opens the door to more complex and powerful applications, such as creating AI agents that can reason over long conversations, building detailed digital twins of real-world systems, and summarizing massive datasets effectively.

Abstract

Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.

View Paper