ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Bo Li, Xuming Hu, Xiaowen Chu

2025-02-03

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Summary

This paper talks about a new method called ChunkKV that makes large AI language models work better with long texts while using less computer memory. It does this by grouping words together in a smart way and keeping the most important groups.

What's the problem?

Big AI language models need a lot of computer memory to handle long texts, which makes them slow and expensive to use. Current methods try to save memory by compressing information about individual words, but this can make the AI miss important connections between words.

What's the solution?

The researchers created ChunkKV, which instead of looking at words one by one, groups them into chunks. It then figures out which chunks are most important for understanding the text and only keeps those. This helps the AI remember the meaning of the text better. They also found a way to reuse some information across different parts of the AI, which makes it work even faster.

Why it matters?

This matters because it could make powerful AI language models work much better with long texts, like books or long conversations, without needing super expensive computers. In tests, ChunkKV worked up to 10% better than other methods, even when it was compressing the information a lot. This could help make advanced AI more accessible and useful for things like summarizing long documents, answering complex questions, or having in-depth conversations.

Abstract

To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that the previous KV cache compression methods measure token importance individually, neglecting the dependency between different tokens in the real-world language characterics. In light of this, we introduce ChunkKV, grouping the tokens in a chunk as a basic compressing unit, and retaining the most informative semantic chunks while discarding the less important ones. Furthermore, observing that ChunkKV exhibits higher similarity in the preserved indices across different layers, we propose layer-wise index reuse to further reduce computational overhead. We evaluated ChunkKV on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K and JailbreakV in-context learning benchmark. Our experiments with instruction tuning and multi-step reasoning (O1 and R1) LLMs, achieve up to 10\% performance improvement under aggressive compression ratios compared to existing methods.

View Paper