Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

Giulio Corallo, Orion Weller, Fabio Petroni, Paolo Papotti

2025-03-11

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge
Reasoning

Summary

This paper talks about a smarter way for AI to use outside information by squeezing knowledge into a compact format, like students summarizing notes for an open-book test, to answer questions faster and better.

What's the problem?

Current methods like RAG miss important details if they’re not in the top search results, while other AI models that read lots of documents are slow and can’t handle big tasks efficiently.

What's the solution?

The new method compresses all relevant info into a smaller, task-focused package using a technique inspired by study notes, letting AI quickly find answers without sifting through piles of data.

Why it matters?

This makes AI helpers like chatbots faster and more accurate for complex tasks (like research or homework), saving time and energy while improving results.

Abstract

Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.

View Paper