Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang, Mingchuan Ma, Yijun Tian, Jinhe Bi, Qian Li, Wei Du, Feng Xiao, Lizhen Cui

2025-11-05

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Summary

This paper is about making large AI models that can understand both text and images work faster and more efficiently, specifically when processing images.

What's the problem?

These powerful AI models struggle with speed because images are broken down into many, many pieces (tokens) for the AI to analyze. Existing methods to reduce the number of these image pieces, like removing unimportant ones or combining them, haven't been tested in a consistent way, making it hard to know which methods actually work best.

What's the solution?

The researchers created a testing platform called UniPruneBench. This platform provides a standard way to evaluate different methods for reducing the number of image pieces the AI needs to process. They tested ten different methods on three popular AI models, using ten different datasets and looking at not just how accurate the AI is, but also how quickly it responds. They found that simply removing pieces randomly is surprisingly effective, and that no single method is best in all situations, with tasks like reading text from images being the most sensitive to changes.

Why it matters?

This work is important because it provides a reliable way to compare and improve the efficiency of AI models that work with both images and text. By understanding which methods work best and why, researchers can build faster and more practical AI systems for a variety of applications.

Abstract

Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.

View Paper