Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
Yiyu Wang, Xuyang Liu, Xiyan Gui, Xinying Lin, Boxue Yang, Chenfei Liao, Tailai Chen, Linfeng Zhang
2025-12-02
Summary
This paper focuses on making video-understanding AI models, specifically those that process videos in real-time, much faster and more efficient.
What's the problem?
Current video AI models struggle with processing live video feeds because analyzing each frame takes a lot of computing power. A major slowdown happens because the model repeatedly analyzes frames that are very similar to each other, and the amount of information the model has to handle gets huge, leading to delays and requiring a lot of memory.
What's the solution?
The researchers developed a system called Streaming Token Compression, or STC. It works in two main ways: first, it remembers and reuses information from similar frames to avoid re-analyzing them. Second, it filters out less important visual details before feeding the information to the main AI model, keeping only the most crucial parts of each frame. This system can be added to existing video AI models without major changes.
Why it matters?
This research is important because it allows for faster and more practical real-time video analysis. By significantly reducing processing time without sacrificing accuracy, it opens the door for applications like quicker video editing, more responsive security systems, and faster analysis of live events.
Abstract
Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose Streaming Token Compression (STC), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: STC-Cacher, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and STC-Pruner, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to 99\% of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by 24.5\% and 45.3\%.