SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun
2025-11-26
Summary
This paper tackles a problem with how large language models (LLMs) handle really long pieces of text. It introduces a new method called SSA (Sparse Sparse Attention) to make these models more efficient and accurate when dealing with extensive information.
What's the problem?
LLMs struggle with long texts because the way they normally pay 'attention' to different parts of the text gets incredibly slow as the text gets longer – it’s like trying to compare every single word to every other word, which takes a lot of computing power. Existing attempts to speed things up by having the model only focus on *some* of the words often hurt the model’s performance. Surprisingly, even methods designed to mimic the full attention process end up being less focused than the original, full attention method, which limits how well they can work.
What's the solution?
The researchers found that the problem is that when the model *doesn't* pay attention to certain words during training, those words never get updated or 'learn' to be ignored properly. To fix this, they created SSA, which trains the model using both a focused ('sparse') attention method *and* the full attention method at the same time. This ensures that all words receive updates and that the focused attention aligns with what the full attention would do, leading to better performance and a more focused attention pattern.
Why it matters?
This work is important because it allows LLMs to process longer texts more efficiently without sacrificing accuracy. It also gives more flexibility in balancing computing power with performance, and it even improves the model’s ability to understand and extrapolate from long texts, meaning it can better handle information beyond what it was originally trained on.
Abstract
The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.