VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen
2024-12-03

Summary
This paper introduces VISTA, a new framework designed to enhance the understanding of long-duration and high-resolution videos by creating synthetic video data from existing datasets.
What's the problem?
Large multimodal models (LMMs) struggle to process and understand long or high-resolution videos effectively. This is mainly due to a lack of high-quality datasets that provide enough varied examples for training. Without sufficient data, these models cannot learn to interpret complex video content accurately.
What's the solution?
VISTA addresses this issue by using a method called Video Spatiotemporal Augmentation to create new, synthetic videos from existing video-caption datasets. It combines different videos both spatially (how they look) and temporally (how they move) to generate longer and higher-quality videos. Additionally, VISTA produces question-answer pairs related to these new videos, helping models learn better. The researchers developed seven augmentation methods and created a dataset called VISTA-400K specifically for training models on long-duration and high-resolution video tasks. They also introduced a new benchmark called HRVideoBench to evaluate how well models understand high-resolution videos.
Why it matters?
This research is significant because it provides a way to improve AI's ability to understand complex video content, which is increasingly important in fields like entertainment, education, and security. By enhancing the quality and quantity of training data available for long videos, VISTA can help develop more accurate and capable AI systems for analyzing video information.
Abstract
Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework.