Visual Context Window Extension: A New Perspective for Long Video Understanding

Hongchen Wei, Zhenzhong Chen

2024-10-02

Visual Context Window Extension: A New Perspective for Long Video Understanding

Summary

This paper introduces a method called Visual Context Window Extension, which helps large multimodal models (LMMs) understand long videos better without needing to retrain them on new data.

What's the problem?

While LMMs perform well with short videos, they struggle with long videos because they are designed to process information in smaller chunks, or context windows. This makes it hard for them to keep track of what’s happening over time in longer videos, and current solutions require a lot of data and computing power to train on long video datasets.

What's the solution?

The authors propose extending the visual context window of LMMs to help them understand longer videos without retraining on large datasets. They analyze why LMMs have difficulty with long videos and find that the way visual and language information is processed needs to be aligned. To manage memory use efficiently, they introduce a strategy called progressive pooling, which reduces the amount of visual data while keeping important details. This allows the model to better handle more frames in long videos.

Why it matters?

This research is significant because it improves how AI models can analyze and understand long videos, which is important for applications like video surveillance, content creation, and autonomous driving. By making it easier for these models to work with longer content, we can enhance their performance in real-world tasks.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it difficult to directly extend the visual tokens to match the language context window. Based on this, we propose to adapt LMMs for long video understanding tasks by extending the visual context window, eliminating the need for retraining on large scalelong video datasets. To further mitigate the significant memory consumption caused by long sequences, we introduce a progressive pooling inference strategy that selectively adjusts the spatial resolution of frame embeddings, reducing the number of visual tokens while retaining important spatial information. Across multiple long video understanding benchmarks, our method consistently improves the performance as the number of video frames increases. On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B. Additionally, in the 256-frame setting, our method reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.

View Paper