Token-Efficient Long Video Understanding for Multimodal LLMs

Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon

2025-03-07

Token-Efficient Long Video Understanding for Multimodal LLMs

Summary

This paper talks about STORM, a new way to help AI understand long videos better and faster by using a special method to process video frames

What's the problem?

Current AI systems that work with videos often treat each frame separately, which makes it hard for them to understand how things change over time in long videos. This approach also uses a lot of computing power, making it slow and expensive to process long videos

What's the solution?

The researchers created STORM, which adds a new part to the AI system that looks at how things change between video frames. It uses something called a Mamba State Space Model to combine information from different frames. STORM also finds ways to reduce the amount of data it needs to process without losing important information about what's happening in the video

Why it matters?

This matters because it allows AI to understand long videos much better and faster than before. It could help improve things like video search, content moderation, or even helping robots understand their surroundings better. By making the process more efficient, it also means we can use AI for video understanding in more places without needing super powerful computers

Abstract

Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5\% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to 8times and the decoding latency by 2.4-2.9times for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm

View Paper