Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu

2024-06-25

Long Context Transfer from Language to Vision

Summary

This paper discusses a new method called Long Context Transfer, which helps large multimodal models (LMMs) better understand long video sequences. It introduces a model named Long Video Assistant (LongVA) that can process a large number of video frames efficiently.

What's the problem?

Current LMMs struggle to process extremely long videos because they generate too many visual tokens (pieces of information from the video). This makes it hard for the models to keep track of everything and understand the video as a whole. Reducing the number of visual tokens is one common solution, but it often leads to loss of important information.

What's the solution?

The researchers approached this problem by extending the context length of the language model used in these multimodal systems. Instead of just reducing visual tokens, they allowed the model to handle many more visual tokens without needing specific training on videos. They also created a new benchmark called V-NIAH to test how well LMMs can retrieve information from long video contexts. The LongVA model was shown to effectively process up to 2000 frames or over 200,000 visual tokens, achieving top performance in evaluations.

Why it matters?

This research is important because it significantly improves how AI models can understand and analyze long videos, which are increasingly common in various applications like streaming services and surveillance. By enhancing LMMs' capabilities, this work could lead to better tools for video analysis and interpretation, benefiting fields like entertainment, security, and education.

Abstract

Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at https://github.com/EvolvingLMMs-Lab/LongVA.

View Paper