ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression
Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, Bohan Zhuang
2024-10-17

Summary
This paper introduces ZipVL, a new method to make large vision-language models (LVLMs) more efficient by improving how they handle important information and manage memory.
What's the problem?
Large vision-language models, which combine visual and textual information, often struggle with efficiency due to two main issues: the attention mechanism can slow down processing when dealing with high-resolution images or videos, and memory usage can become excessive when retrieving stored information. These problems can hinder the model's performance and speed.
What's the solution?
To address these challenges, the authors developed ZipVL, which uses a dynamic approach to focus only on the most important pieces of information (tokens) during processing. This means that instead of trying to process everything equally, ZipVL identifies which tokens are crucial and allocates resources accordingly. Additionally, it compresses memory usage by using different levels of data precision based on the importance of the tokens. This allows ZipVL to speed up processing by 2.6 times and reduce memory use by 50% while maintaining high accuracy.
Why it matters?
This research is significant because it enhances the efficiency of AI systems that need to understand both images and text. By making these models faster and less memory-intensive, ZipVL can help improve applications in areas like real-time video analysis, automated content creation, and more, making powerful AI tools accessible even on devices with limited resources.
Abstract
The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention maps within LVLMs. This sparsity can be leveraged to accelerate attention computation or compress the KV cache through various approaches. However, most studies focus on addressing only one of these bottlenecks and do not adequately support dynamic adjustment of sparsity concerning distinct layers or tasks. In this paper, we present ZipVL, an efficient inference framework designed for LVLMs that resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform attention mechanism solely on those important tokens to accelerate the prefill phase. To mitigate the memory bottleneck in the decoding phase, we employ mixed-precision quantization to the KV cache, where high-bit quantization is used for caches of important tokens, while low-bit quantization is applied to those of less importance. Our experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6times and reduce GPU memory usage by 50.0%, with a minimal accuracy reduction of only 0.2% on Video-MME benchmark over LongVA-7B model, effectively enhancing the generation efficiency of LVLMs.