Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Xiang Hu, Zhanchao Zhou, Ruiqi Liang, Zehuan Li, Wei Wu, Jianguo Li

2025-12-01

Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Summary

This research focuses on creating AI systems with really good long-term memory, which they define as being able to effectively process and understand extremely long pieces of information.

What's the problem?

Current AI models struggle with very long inputs because the way they process information, called 'attention,' becomes too slow and inefficient when dealing with huge amounts of text. It's like trying to remember everything that happened in a really long movie – it gets hard to keep track of all the details and how they relate to each other. Specifically, existing methods lack the ability to focus on only the important parts, quickly jump to relevant information, and perform well with lengths they haven't specifically been trained on.

What's the solution?

The researchers developed a new attention mechanism called Hierarchical Sparse Attention, or HSA. This method allows the AI to focus on only the most important parts of the long input, access information randomly (like flipping to a specific page in a book), and handle inputs of varying lengths effectively. They then built a large AI model, HSA-UltraLong, using this new attention mechanism and trained it on a massive amount of text data. This model is designed to handle contexts up to 16 million tokens, which is a huge amount of information.

Why it matters?

This work is important because it pushes the boundaries of what AI can remember and understand. Being able to process ultra-long contexts opens up possibilities for AI to tackle more complex tasks, like summarizing entire books, understanding lengthy legal documents, or having more coherent and in-depth conversations. It provides a foundation for future advancements in AI memory and reasoning capabilities.

Abstract

This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: sparsity, random-access flexibility, and length generalization. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90\% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling.

View Paper