LLaVA-Scissor, a token compression strategy for video multimodal large language models, uses Semantic Connected Components to compress tokens effectively while maintaining semantic coverage and outperforming other methods.

This paper talks about LLaVA-Scissor, a smart way to make large language models that understand videos work faster and better by compressing the visual information they process without losing important details.

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract