TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen
2026-04-07
Summary
This paper addresses the problem of large language models running out of memory when processing long pieces of text, specifically focusing on the 'KV cache' which stores information needed for reasoning. They introduce a new method called TriAttention to make these models more efficient without sacrificing accuracy.
What's the problem?
Large language models need to store information about previous words in a conversation or document to understand the context – this is done using something called the KV cache. When dealing with very long texts, this cache can become huge, quickly using up all available memory. Current methods for compressing this cache rely on figuring out which parts of the text are most important using attention scores, but these scores become unreliable because of how the model processes position information (using something called RoPE). This unreliability leads to the model making mistakes when reasoning through long texts.
What's the solution?
The researchers noticed that the internal calculations the model uses to relate words to each other (specifically the Q and K vectors) tend to cluster around certain central values and don't change much as you move through the text. They realized this clustering means the model naturally focuses on words at specific distances from each other. TriAttention uses this insight to identify important words by looking at these distances and the strength of the signals (norms) from the Q and K vectors. Essentially, it's a smarter way to decide what to keep in the KV cache.
Why it matters?
TriAttention allows large language models to handle much longer texts without needing more powerful (and expensive) hardware. In their tests, it matched the accuracy of the full, uncompressed model while using significantly less memory or processing information much faster. This is a big step towards making these powerful models more accessible and practical for real-world applications like analyzing long documents or having extended conversations.
Abstract
Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.