HoliTom: Holistic Token Merging for Fast Video Large Language Models

Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang

2025-05-28

HoliTom: Holistic Token Merging for Fast Video Large Language Models

Summary

This paper talks about HoliTom, a new method that makes AI models much faster at understanding and processing videos by smartly cutting down on the amount of information they need to handle.

What's the problem?

The problem is that video language models usually have to process a huge amount of data, which makes them slow and requires a lot of computer power, especially when they try to analyze every single detail in every frame of a video.

What's the solution?

The researchers created HoliTom, which uses two main tricks: it first splits the video into important time segments so the model doesn't have to look at everything at once, and then it merges similar pieces of information together inside the model, so it doesn't waste time on stuff that's basically the same.

Why it matters?

This matters because it means AI can analyze and understand videos much more quickly and efficiently, making it easier to use these models for things like video search, content creation, and even real-time video analysis.

Abstract

HoliTom combines outer-LLM pruning through global temporal segmentation with inner-LLM token similarity-based merging to significantly reduce computational inefficiency in video LLMs without sacrificing performance.

View Paper