A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models
Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou
2025-08-05
Summary
This paper talks about GlimpsePrune, a smart system that helps large vision-language AI models work faster and use less memory by quickly removing unimportant parts of an image before answering questions.
What's the problem?
The problem is that these big AI models have to process a huge number of tiny image pieces called visual tokens, which takes a lot of computer power and slows things down, especially when some of those tokens are not actually needed.
What's the solution?
GlimpsePrune solves this by taking a quick data-driven 'glimpse' at the image to figure out which visual tokens are not important, then removing them all at once before the model generates its answer, keeping the important information while throwing away the unnecessary parts.
Why it matters?
This matters because it makes vision-language AI systems much more efficient and faster without losing accuracy, which can help improve how these models work in applications like image-based question answering and other tasks that require understanding pictures and text together.
Abstract
A dynamic pruning framework, GlimpsePrune, improves efficiency in Large Vision-Language Models by adaptively removing irrelevant visual tokens without degrading performance.