Trainable Dynamic Mask Sparse Attention

Jingze Shi, Yifan Wu, Bingheng Wu, Yiran Peng, Liangdong Wang, Guang Liu, Yuyu Luo

2025-08-04

Summary

This paper talks about Trainable Dynamic Mask Sparse Attention (DMA), a new attention method that helps large language models handle very long texts more efficiently by focusing only on important parts.

What's the problem?

The problem is that normal attention methods use a lot of computing power because they try to process every part of the text equally, which becomes a big problem when the text is very long.

What's the solution?

DMA solves this by learning to create a dynamic mask that decides which parts of the text are important based on the content and position, so the model only pays attention to the important parts and skips unnecessary calculations, reducing the work while keeping all the important information.

Why it matters?

This matters because it lets language models work faster and use less computer power when reading and understanding long documents, making them better for tasks like writing, reasoning, and answering complex questions.

Abstract

A dynamic mask sparse attention mechanism, DMA, improves long-context modeling in large language models by reducing computational complexity while maintaining information fidelity.

View Paper