SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He

2024-10-08

SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

Summary

This paper presents SwiftKV, a new method designed to make large language models (LLMs) faster and more efficient when processing input data, while still maintaining high-quality output.

What's the problem?

When using LLMs for tasks like summarization or code generation, the initial step of processing input data (called 'prefill') can take a lot of time and computing power. This is especially true because the input prompts are often much longer than the actual output, leading to delays and increased costs.

What's the solution?

The authors developed SwiftKV, which uses three main techniques to improve efficiency: First, it allows later layers of the model to use information from earlier layers to speed up processing. Second, it combines the memory used by nearby layers to save space and allow for more data to be processed at once. Lastly, it includes a method that adapts existing models to work with SwiftKV without losing accuracy. As a result, SwiftKV can reduce the computational needs for processing by 50% and memory usage by 62.5%, all while producing high-quality results.

Why it matters?

This research is significant because it enables faster and cheaper use of powerful AI models in various applications. By improving how LLMs handle input data, SwiftKV could help make advanced AI technologies more accessible and usable in real-world scenarios.

Abstract

LLM inference for popular enterprise use cases, such as summarization, RAG, and code-generation, typically observes orders of magnitude longer prompt lengths than generation lengths. This characteristic leads to high cost of prefill and increased response latency. In this paper, we present SwiftKV, a novel model transformation and distillation procedure specifically designed to reduce the time and cost of processing prompt tokens while preserving high quality of generated tokens. SwiftKV combines three key mechanisms: i) SingleInputKV, which prefills later layers' KV cache using a much earlier layer's output, allowing prompt tokens to skip much of the model computation, ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the memory footprint and support larger batch size for higher throughput, and iii) a knowledge-preserving distillation procedure that can adapt existing LLMs for SwiftKV with minimal accuracy impact and low compute and data requirement. For Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5% while incurring minimum quality degradation across a wide range of tasks. In the end-to-end inference serving using an optimized vLLM implementation, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100 GPUs.

View Paper