PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin

2024-10-23

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Summary

This paper introduces PyramidDrop, a new method for improving the efficiency of large vision-language models (LVLMs) by reducing the number of image tokens used without losing important information.

What's the problem?

Large vision-language models need to process a lot of information from images, which can require many tokens (small pieces of data). This can be very expensive in terms of computational power, especially as the resolution of images increases, making it slow and costly to train and use these models.

What's the solution?

The researchers found that while all image tokens are important in the early stages of processing, many become redundant in later stages. PyramidDrop addresses this by dividing the model into stages and selectively dropping some image tokens in deeper layers, which reduces the amount of data processed while maintaining performance. This method speeds up training and inference times significantly.

Why it matters?

PyramidDrop is important because it allows for faster and more efficient processing of images in models that understand both text and visuals. This can lead to quicker responses in applications like image recognition and natural language processing, making technology more effective and accessible.

Abstract

In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom "A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information, ultimately diminishing model performance. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers of the model. To this end, we propose PyramidDrop, a visual redundancy reduction strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. Specifically, we partition the LVLM into several stages and drop part of the image tokens at the end of each stage with a pre-defined ratio, creating pyramid-like visual tokens across model layers. The dropping is based on a lightweight similarity calculation with a negligible time overhead. Extensive experiments demonstrate that PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts. We hope that the insights and approach introduced by PyramidDrop will inspire future research to further investigate the role of image tokens in LVLMs.

View Paper