Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Mark Endo, Xiaohan Wang, Serena Yeung-Levy

2024-12-18

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Summary

This paper talks about a new method called Feather that improves how visual tokens are pruned in Vision-Language Models (VLMs) to make them faster and more efficient while maintaining performance.

What's the problem?

Vision-Language Models are powerful AI systems that combine visual and text information, but they often slow down when processing images because they have to handle a lot of visual data. Current methods for speeding up these models involve removing (or 'pruning') unnecessary visual tokens, but these methods can sometimes remove important information, especially from the top parts of images. This can lead to poor performance in certain tasks, like locating objects in images.

What's the solution?

The authors propose a new approach called FEATHER, which improves the token pruning process by addressing the issues found in earlier methods. FEATHER uses a two-stage pruning approach that first removes unnecessary tokens and then ensures that all parts of the image are adequately covered. It also incorporates uniform sampling to better capture important details across the entire image. This method allows for significant speed improvements while enhancing accuracy, particularly in tasks like localization.

Why it matters?

This research is important because it helps make AI models that understand both images and text more efficient without sacrificing their ability to perform complex tasks. By improving how these models process visual information, FEATHER can lead to faster and more reliable AI applications in areas such as robotics, autonomous vehicles, and augmented reality.

Abstract

Recent works on accelerating Vision-Language Models show that strong performance can be maintained across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model and find that its strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks' limited ability to assess fine-grained visual capabilities. Namely, we demonstrate a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, this issue is only reflected in performance for a small subset of tasks such as localization. For the other evaluated tasks, strong performance is maintained with the flawed pruning strategy. Noting the limited visual capabilities of the studied acceleration technique, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that (1) resolves the identified issue with early-layer pruning, (2) incorporates uniform sampling to ensure coverage across all image regions, and (3) applies pruning in two stages to allow the criteria to become more effective at a later layer while still achieving significant speedup through early-layer pruning. With comparable computational savings, we find that FEATHER has more than 5times performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.

View Paper