Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Nathan Godey, Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini, Éric de la Clergerie, Benoît Sagot

2025-03-05

Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Summary

This paper talks about Q-Filters, a new method to make AI language models work faster and use less memory without losing their ability to understand and generate text

What's the problem?

As AI language models get bigger and handle longer pieces of text, they need a lot of memory to store information about what they've already processed. This slows them down and limits how much they can handle at once

What's the solution?

The researchers created Q-Filters, which looks at how different parts of the AI model relate to each other mathematically. It uses this information to figure out which parts of the stored information are most important, keeping those and getting rid of the rest. This method doesn't need any special training and works well with existing fast AI techniques

Why it matters?

This matters because it could make AI language models much more efficient, allowing them to handle longer texts and more complex tasks without needing more powerful computers. In tests, Q-Filters was able to compress the stored information 32 times while still being 99% accurate on tough tasks. This could lead to faster, more capable AI assistants and tools that can work with very long documents or conversations

Abstract

Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

View Paper