Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim
2025-07-11
Summary
This paper talks about a new way to make video large language models faster by reducing how many visual pieces of information they have to process without needing to train the model again.
What's the problem?
Video models have to look at many small parts of many frames, which takes a lot of computing power and time. Trying to speed them up without losing accuracy is hard because removing too much can make the model miss important details.
What's the solution?
The researchers created a method that smartly combines similar parts across both space and time in the video. This merging reduces the number of visual tokens the model has to process, cutting computation costs but still keeping the important information, and it doesn't require retraining the model.
Why it matters?
This matters because it allows powerful video AI models to work faster and more efficiently on regular computers without needing extra training, making advanced video understanding more accessible and practical.
Abstract
A training-free method merges spatio-temporal tokens in video large language models to reduce computational cost while maintaining accuracy.