VisionZip: Longer is Better but Not Necessary in Vision Language Models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia
2024-12-06
Summary
This paper discusses VisionZip, a new method that improves the efficiency of vision-language models by reducing the number of visual tokens used without losing performance.
What's the problem?
Vision-language models (VLMs) have been improved by making visual tokens longer, but this increases the computational costs significantly. Many of these longer tokens contain a lot of unnecessary information, leading to redundancy. This means that the models are using more resources than needed, which can slow down processing and make them less efficient.
What's the solution?
VisionZip addresses this issue by selecting only the most informative visual tokens to use as input for the language model. This method reduces redundancy and improves efficiency while still maintaining the overall performance of the model. VisionZip can be applied to various tasks involving images and videos and is especially effective in multi-turn dialogues, where previous methods often struggle. The results show that VisionZip performs better than existing methods and speeds up processing times significantly.
Why it matters?
This research is important because it helps make vision-language models faster and more efficient, which is crucial for real-world applications like chatbots, virtual assistants, and interactive systems. By optimizing how these models handle visual information, VisionZip can lead to better user experiences and more effective AI systems.
Abstract
Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .