MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji

2024-06-14

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Summary

This paper presents a new method called Multi-Layer Key-Value (MLKV) sharing, designed to make transformer models more memory-efficient during text processing. It focuses on improving how these models store important information while generating text.

What's the problem?

Transformers, which are a type of AI model used for tasks like language processing, often use a system called Key-Value (KV) caching to remember important details as they generate text. However, as these models become larger and handle more data, the amount of memory needed for this caching can become a significant problem, making it hard to run them efficiently on devices with limited resources.

What's the solution?

To solve this issue, the authors developed MLKV sharing, which allows the model to share KV information not just within a single layer but across multiple layers of the transformer. This means that instead of each layer needing its own separate cache, they can use the same cache for several layers, which greatly reduces the total memory required. Their experiments show that this method can cut down the size of the KV cache by up to six times compared to previous methods while maintaining similar performance levels.

Why it matters?

This research is important because it helps make powerful AI models more accessible by reducing their memory requirements. By allowing these models to run more efficiently, MLKV could enable their use in a wider range of applications, including on devices that previously couldn't support such large models. This advancement is crucial for developing more effective AI systems in various fields like natural language processing and machine learning.

Abstract

Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv

View Paper