Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache
Xiaoran Liu, Siyang He, Qiqi Wang, Ruixiao Li, Yuerong Song, Zhigeng Liu, Linlin Li, Qun Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
2025-06-16
Summary
This paper talks about FourierAttention, a new method that helps large language models (LLMs) use memory more efficiently without needing extra training. It does this by compressing parts of the model's key-value (KV) cache, which stores information from the context, using special math called Fourier bases to keep important long-range information intact while saving space.
What's the problem?
The problem is that when LLMs handle long texts or contexts, their memory use grows a lot because the KV cache that stores context information gets very large. Existing methods either treat all parts of this cache the same or remove parts based on attention scores, causing slower processing or less accurate results because they don’t smartly focus on which parts are really important.
What's the solution?
The solution was to look at how different parts of the transformer heads handle information differently: some focus on short-range or local context, while others capture long-range dependencies. FourierAttention compresses the less important local parts using orthogonal Fourier transforms into a fixed, smaller size, while keeping the essential long-range parts unchanged. This compression keeps the model accurate but uses much less memory. They also created a custom software kernel to speed up this process efficiently without losing performance.
Why it matters?
This matters because it helps large AI models work faster and handle longer texts without needing tons of extra memory, making them more practical to use in real-world applications. It preserves the model’s accuracy and ability to understand long contexts while reducing resource costs, improving AI accessibility and performance.
Abstract
FourierAttention is a training-free framework that enhances memory efficiency in Large Language Models by compressing long-context-insensitive transformer head dimensions using orthogonal Fourier bases, while maintaining accuracy.