Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, Xiaowen Chu

2025-02-05

Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

Summary

This paper explores how compressing the KV cache, a memory system used by large language models (LLMs), affects their ability to perform tasks like reasoning, generating code, and understanding long texts. It introduces a new method called ShotKV to improve performance during compression.

What's the problem?

LLMs require a lot of memory to store information while processing tasks, but compressing this memory (KV cache) can sometimes hurt their ability to perform well on certain tasks, especially arithmetic reasoning. Current compression methods often lead to significant drops in accuracy.

What's the solution?

The researchers studied different compression methods and found that some models handle compression better than others. They developed ShotKV, a new approach that separates how memory is handled during different phases of processing, ensuring better accuracy even under aggressive compression. ShotKV improves performance by maintaining coherence in how tasks are processed step-by-step.

Why it matters?

This research matters because it helps make LLMs more efficient without sacrificing their ability to perform complex tasks. By reducing memory requirements while keeping accuracy high, ShotKV enables LLMs to be used more effectively in real-world applications where computational resources are limited.

Abstract

This paper investigates an under-explored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. While existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive empirical study evaluating prominent KV cache compression methods across diverse tasks, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and generation.Our analysis reveals that KV cache compression methods exhibit task-specific performance degradation. Arithmetic reasoning tasks prove particularly sensitive to aggressive compression, with different methods showing performance drops of 17.4%-43.3%. Notably, the DeepSeek R1 Distill model exhibits more robust compression tolerance compared to instruction-tuned models, showing only 9.67%-25.53% performance degradation. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves 9%-18% performance improvements on long-context generation tasks under aggressive compression ratios.

View Paper