LeanK: Learnable K Cache Channel Pruning for Efficient Decoding
Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, Lili Qiu
2025-08-07
Summary
This paper talks about LeanK, a new method that helps large language models run faster and use less memory by removing parts of the key cache that are not very important during the process of generating text.
What's the problem?
The problem is that large language models use a memory system called the key cache to keep information during text generation, but this system can get very large and slow down the process, making it hard to run the models efficiently on limited hardware.
What's the solution?
The solution was to create LeanK, which uses learning techniques to figure out which parts of the key cache are less important and can be safely removed. By pruning these unimportant parts, the model uses less memory and speeds up decoding without losing accuracy in the text it generates.
Why it matters?
This matters because it makes it easier and cheaper to run large language models on devices with limited memory or computing power, allowing more people and applications to benefit from advanced AI technology without sacrificing performance.
Abstract
LeanK, a learning-based method, prunes unimportant key cache channels in large language models to reduce memory usage and accelerate decoding without sacrificing accuracy.