LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra

2024-10-24

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Summary

This paper introduces LongVU, a new method designed to help large language models (LLMs) better understand long videos by compressing the video data while keeping important visual details.

What's the problem?

Processing long videos is challenging for LLMs because they can only handle a limited amount of information at once. When there are too many frames, the model might lose important details or become inefficient, leading to poor understanding of the video content.

What's the solution?

LongVU uses a technique called spatiotemporal adaptive compression, which reduces the number of video tokens (pieces of data representing video frames) without losing key visual information. It does this by identifying and removing similar frames and using guided queries to focus on the most relevant features. This allows the model to process many frames efficiently while maintaining high-quality understanding.

Why it matters?

This research is significant because it improves how AI models analyze long videos, making them more effective for tasks like video summarization and content analysis. By enhancing video understanding capabilities, LongVU can be applied in various fields such as entertainment, education, and surveillance, where analyzing long video content is essential.

Abstract

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

View Paper