SCBench: A KV Cache-Centric Analysis of Long-Context Methods
Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu
2024-12-16
Summary
This paper talks about SCBench, a new benchmark designed to evaluate how well long-context language models (LLMs) use a special memory system called KV cache during their operation.
What's the problem?
Long-context LLMs are powerful tools for processing and generating text, but they often struggle with efficiency when handling large amounts of information. Current testing methods usually only look at single requests, ignoring how these models actually work in real-world situations where they reuse memory. This oversight can lead to inaccurate assessments of their performance.
What's the solution?
SCBench addresses this issue by creating a comprehensive evaluation framework that focuses on the entire lifecycle of the KV cache. It analyzes various aspects such as how the cache is generated, compressed, retrieved, and loaded. The benchmark includes 12 different tasks that test the models in scenarios where they share context, helping to provide a clearer picture of their capabilities across various tasks and conditions.
Why it matters?
This research is important because it improves our understanding of how long-context LLMs function in practical applications. By focusing on KV cache performance, SCBench can help developers optimize these models for better efficiency and effectiveness, ultimately leading to more advanced AI applications in fields like natural language processing and machine learning.
Abstract
Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench(SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With it, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on 8 long-context LLMs. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios. https://aka.ms/SCBench.