CommVQ: Commutative Vector Quantization for KV Cache Compression
Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan
2025-06-24
Summary
This paper talks about CommVQ, a new method that reduces the amount of memory needed during large language model (LLM) use by compressing the key-value (KV) cache, which stores important information used to generate text.
What's the problem?
The problem is that the KV cache can get very large when LLMs work with long pieces of text, which makes it hard to run these models efficiently on devices like GPUs because they run out of memory.
What's the solution?
The researchers used a technique called additive quantization to compress whole vectors of keys and values rather than individual numbers. They also designed the compression to work well with a special positioning system in the model called Rotary Position Embedding (RoPE), making it fast and accurate to decode.
Why it matters?
This matters because it lets big language models handle much longer texts without needing huge amounts of memory, making it possible to run advanced AI on regular hardware more efficiently and at a lower cost.
Abstract
Commutative Vector Quantization (CommVQ) reduces memory usage in long-context LLM inference by compressing the KV cache with additive quantization and integration of Rotary Position Embedding (RoPE).