Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

Siyuan Yan, Guo-Qing Jiang, Yuchen Zhang, Xiaoxing Ma, Ran Zhu, Chun Cao, Jingwei Xu

2025-10-24

Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

Summary

This paper introduces a new method called Adamas to make large language models faster and more efficient when dealing with very long pieces of text.

What's the problem?

Large language models are getting better at processing huge amounts of text, which is great for things like summarizing long documents or having extended conversations. However, the way these models currently work, specifically a part called 'self-attention,' becomes incredibly slow and computationally expensive as the text gets longer. Existing attempts to speed things up by focusing on only parts of the text haven't been very accurate, often missing important information.

What's the solution?

The researchers developed Adamas, a technique that cleverly compresses the information from the long text using mathematical transformations, grouping similar pieces together, and reducing the precision of the data. It then efficiently identifies the most important parts of the text to focus on when making predictions. This allows the model to process long texts much faster without sacrificing accuracy.

Why it matters?

This work is important because it addresses a major bottleneck in large language models. By making these models faster and more efficient, Adamas opens the door to more practical applications of long-context processing, like analyzing entire books or complex codebases, and allows for more responsive and interactive AI systems.

Abstract

Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.

View Paper