MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian

2025-08-26

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Summary

This paper focuses on making Vision-Language Models, which are good at understanding images and text together, work faster and more efficiently. They do this by figuring out which parts of the image information are actually important and getting rid of the rest.

What's the problem?

Vision-Language Models process images by breaking them down into smaller pieces called 'vision tokens'. However, many of these tokens are redundant, meaning they don't add much new information and slow down the model. Previous methods for reducing these tokens only looked at the image *or* the text separately, ignoring the fact that they work best *together*. Also, there wasn't a clear, general rule for deciding which tokens to remove across both image and text.

What's the solution?

The researchers developed a method called MMTok that uses both image and text information to select the most important vision tokens. They framed the problem as finding the smallest set of image tokens that 'cover' the information in the text and the original image tokens. To refine this process, they even used the Vision-Language Model itself to improve the quality of the text information, which then helps guide the removal of unnecessary image tokens. Essentially, they're using the model to help itself become more efficient.

Why it matters?

This work is important because it significantly speeds up Vision-Language Models without sacrificing much accuracy. By intelligently pruning the image tokens, they achieved substantial speedups – up to 1.87x faster – while maintaining almost all of the original performance. This means these models can be used more easily in real-world applications where speed and efficiency are crucial, and even with very few image tokens, the model still performs well.

Abstract

Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the criterion of coverage. We first formulate the subset selection problem as a maximum coverage problem. Afterward, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. Finally, a VLM agent can be adopted to further improve the quality of text tokens for guiding vision pruning. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore, with only four vision tokens, it still preserves 87.7% of the original performance on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection.

View Paper