A survey of token compression techniques for multimodal large language models, focusing on image, video, and audio data, categorizing methods by modality and mechanism.

This paper talks about different methods to reduce the number of tokens, or pieces of information, that multimodal large language models process when working with images, videos, and audio to make the models more efficient.

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract