VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges
Yuxuan Wang, Cihang Xie, Yang Liu, Zilong Zheng
2024-09-04

Summary
This paper talks about VideoLLaMB, a new framework designed to help computers understand videos better by using a system of memory that keeps track of important information over time.
What's the problem?
Current video-language models struggle with understanding long videos because they require a lot of computing power and often lack enough labeled data for training. This makes it hard for researchers to develop effective tools for analyzing and interacting with video content.
What's the solution?
VideoLLaMB addresses these challenges by using a method called temporal memory tokens, which helps the model remember important details from videos as they progress. It also segments videos into smaller parts using a technique called SceneTilling, which makes it easier to manage and analyze the content. The model has been shown to perform significantly better than existing models, achieving higher scores on various benchmarks for video question answering and planning tasks.
Why it matters?
This research is important because it enhances our ability to analyze and interact with videos in real-time, which can be useful in many areas like education, entertainment, and surveillance. By improving how AI understands video content, VideoLLaMB sets the stage for more advanced applications that can benefit from detailed video analysis.
Abstract
Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.