< Explain other AI papers

FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim

2025-02-04

FastKV: KV Cache Compression for Fast Long-Context Processing with
  Token-Selective Propagation

Summary

This paper talks about FastKV, a new method to make large language models (LLMs) faster and more efficient when working with long pieces of text. It focuses on compressing the memory these models use while keeping their accuracy high.

What's the problem?

LLMs are great at handling long texts, but they need a lot of memory to store contextual information in something called a key-value (KV) cache. This slows them down and makes them less efficient, especially when processing long sequences. Previous methods to compress these caches focused on saving memory but didn’t improve the speed much.

What's the solution?

The researchers created FastKV, which uses a technique called Token-Selective Propagation (TSP). TSP keeps all the important context in the early layers of the model but only passes on the most necessary parts to deeper layers, reducing the workload. FastKV also uses grouped-query attention (GQA) to make the compression even more efficient. This approach improves processing speed while maintaining accuracy, achieving better results than previous methods like HeadKV.

Why it matters?

This research is important because it makes LLMs faster and more practical for tasks that involve long texts, such as document analysis or generating detailed summaries. By improving both speed and efficiency without losing accuracy, FastKV helps unlock new possibilities for using AI in real-world applications where time and resources are limited.

Abstract

While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to enhance latency for long-context sequences. To enhance processing speeds while maintaining accuracy, FastKV adopts a novel Token-Selective Propagation (TSP) approach that retains the full context information in the initial layers of LLMs and selectively propagates only a portion of this information in deeper layers even in the prefill stage. Additionally, FastKV incorporates grouped-query attention (GQA)-aware KV cache compression to exploit the advantages of GQA in both memory and computational efficiency. Our experimental results show that FastKV achieves 2.00times and 1.40times improvements in time-to-first-token (TTFT) and throughput, respectively, compared to HeadKV, the state-of-the-art KV cache compression method. Moreover, FastKV successfully maintains accuracy on long-context benchmarks at levels comparable to the baselines. Our code is available at https://github.com/dongwonjo/FastKV.