< Explain other AI papers

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou

2025-06-30

LLaVA-Scissor: Token Compression with Semantic Connected Components for
  Video LLMs

Summary

This paper talks about LLaVA-Scissor, a smart way to make large language models that understand videos work faster and better by compressing the visual information they process without losing important details.

What's the problem?

The problem is that videos have so many frames and details that processing all this information overwhelms AI models, making them slow and less efficient because they have limited ability to handle long streams of data.

What's the solution?

The researchers introduced a method called Semantic Connected Components that groups together important visual parts in a video and compresses them into fewer tokens, which helps keep the meaning and important details while reducing the amount of data the AI has to work with. This makes video processing more effective and faster than previous methods.

Why it matters?

This matters because it allows AI to better understand and analyze long or complex videos quickly, which can improve applications like video search, automatic video summaries, and interactive video-related tasks.

Abstract

LLaVA-Scissor, a token compression strategy for video multimodal large language models, uses Semantic Connected Components to compress tokens effectively while maintaining semantic coverage and outperforming other methods.