SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou

2024-12-23

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

Summary

This paper talks about SCOPE, a new framework designed to optimize the Key-Value (KV) cache used in large language models (LLMs) for generating long outputs. It focuses on improving the efficiency of how these models handle information during the generation process.

What's the problem?

As LLMs generate longer texts, the KV cache can become a bottleneck, slowing down the process and making it less efficient. Current methods often overlook optimizing the decoding phase, which is crucial for understanding and reasoning tasks. Excessive compression can also lead to loss of important information, making it harder for the model to perform well.

What's the solution?

SCOPE addresses these issues by separately optimizing the KV cache during two phases: prefill and decoding. During the prefill phase, it keeps essential information intact, while in the decoding phase, it uses a sliding strategy to select important data points (heavy hitters) that are crucial for generating accurate outputs. This method reduces memory usage and improves efficiency without sacrificing performance.

Why it matters?

This research is important because it enhances the performance of LLMs when generating long texts, making them faster and more effective. By optimizing how they store and access information, SCOPE can help improve various applications that rely on LLMs, such as chatbots, content generation, and more complex reasoning tasks.

Abstract

Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation. Despite the numerous efforts in this area, the optimization for the decoding phase is generally ignored. However, we believe such optimization is crucial, especially for long-output generation tasks based on the following two observations: (i) Excessive compression during the prefill phase, which requires specific full context impairs the comprehension of the reasoning task; (ii) Deviation of heavy hitters occurs in the reasoning tasks with long outputs. Therefore, SCOPE, a simple yet efficient framework that separately performs KV cache optimization during the prefill and decoding phases, is introduced. Specifically, the KV cache during the prefill phase is preserved to maintain the essential information, while a novel strategy based on sliding is proposed to select essential heavy hitters for the decoding phase. Memory usage and memory transfer are further optimized using adaptive and discontinuous strategies. Extensive experiments on LongGenBench show the effectiveness and generalization of SCOPE and its compatibility as a plug-in to other prefill-only KV compression methods.

View Paper