Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng

2025-02-18

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention

Summary

This paper talks about a new way to make AI language models work better with long pieces of text, called Native Sparse Attention (NSA). It's like teaching a computer to read a long book more efficiently by focusing on the most important parts.

What's the problem?

Current AI models struggle with long texts because they have to look at every single word, which takes a lot of time and computer power. This makes it hard for them to understand and work with long documents or conversations.

What's the solution?

The researchers created NSA, which works smarter, not harder. It uses a clever system to compress some parts of the text, pick out the most important words, and pay special attention to nearby words. They also made sure NSA works well with modern computer hardware and can learn and improve itself during training. This makes NSA much faster than old methods, especially when dealing with really long texts.

Why it matters?

This matters because it could make AI much better at tasks that involve long texts, like summarizing big documents, answering questions about long stories, or understanding complex conversations. NSA makes AI faster and more efficient without losing accuracy, which means we could use AI for more complex tasks without needing super powerful computers. This could lead to smarter digital assistants, better research tools, and more advanced AI systems in general.

Abstract

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

View Paper