Small Vision-Language Models are Smart Compressors for Long Video Understanding
Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Chenchen Zhu
2026-04-10
Summary
This paper introduces Tempo, a new method for helping AI understand really long videos, like those lasting an hour or more.
What's the problem?
Current AI models struggle with long videos because they have a limited 'memory' – they can only process a certain amount of information at once. Videos have a lot of visual information, and simply showing the AI everything quickly fills up its memory. Existing methods of shortening videos often throw away important details or waste space on unimportant parts, making it hard for the AI to understand what's happening.
What's the solution?
Tempo solves this by using a smaller AI model to first 'compress' the video, focusing on the most important parts based on what the user is asking it to look for. It's like a smart editor that highlights key moments and summarizes the rest. This compression isn't just random; it intelligently allocates more 'attention' to segments that are relevant to the question being asked, while quickly summarizing less important sections. This process happens quickly and doesn't require extra training.
Why it matters?
This work is important because it shows that we can make AI understand very long videos without needing massive amounts of computing power or simply increasing the AI's memory. It demonstrates that focusing on *what* the AI needs to know, rather than just feeding it everything, is the key to truly understanding long-form video content, and it achieves better results than current state-of-the-art models like GPT-4o and Gemini 1.5 Pro.
Abstract
Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free O(1) dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.