LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou
2025-06-30
Summary
This paper talks about LLaVA-Scissor, a smart way to make large language models that understand videos work faster and better by compressing the visual information they process without losing important details.
What's the problem?
The problem is that videos have so many frames and details that processing all this information overwhelms AI models, making them slow and less efficient because they have limited ability to handle long streams of data.
What's the solution?
The researchers introduced a method called Semantic Connected Components that groups together important visual parts in a video and compresses them into fewer tokens, which helps keep the meaning and important details while reducing the amount of data the AI has to work with. This makes video processing more effective and faster than previous methods.
Why it matters?
This matters because it allows AI to better understand and analyze long or complex videos quickly, which can improve applications like video search, automatic video summaries, and interactive video-related tasks.
Abstract
LLaVA-Scissor, a token compression strategy for video multimodal large language models, uses Semantic Connected Components to compress tokens effectively while maintaining semantic coverage and outperforming other methods.