Prism: Spectral-Aware Block-Sparse Attention
Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu
2026-02-11
Summary
This paper focuses on making large language models faster, specifically when they're initially processing long pieces of text. It introduces a new method called Prism to help the model quickly identify which parts of the text are most important to pay attention to.
What's the problem?
Large language models struggle with long texts because they need to consider the relationship between every word, which takes a lot of computing power. A common way to speed things up is 'block-sparse attention,' where the model only focuses on certain 'blocks' of text. However, figuring out *which* blocks are important is slow and inefficient because current methods rely on approximations that aren't very accurate, often needing to check individual words within the blocks. The core issue is that the way these methods currently estimate block importance using a technique called 'mean pooling' interferes with how the model understands the position of words in the text, creating a 'blind spot' for important local details.
What's the solution?
The researchers discovered that the problem stems from how 'mean pooling' interacts with the model's positional encoding (RoPE). To fix this, they developed Prism, a technique that doesn't require any additional training. Prism separates the process of identifying important blocks into two parts: one focusing on broad, general information and another focusing on precise positional details. It then uses a clever 'temperature calibration' method to restore the positional information that was lost during mean pooling, allowing the model to accurately assess block importance without needing to examine individual words. This makes the process much faster.
Why it matters?
This work is important because it significantly speeds up the initial processing of long texts for large language models. By making this step more efficient, the model can start generating text much faster, and the researchers demonstrated speedups of up to 5.1 times while maintaining the same level of accuracy as if the model had considered all the words. This could lead to more responsive and practical applications of large language models.
Abstract
Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to 5.1times speedup.