VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

2024-06-13

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Summary

This paper introduces VideoLLaMA 2, a new type of language model designed to improve how computers understand videos and audio. It focuses on capturing both the visual and sound aspects of videos to enhance tasks like answering questions about video content.

What's the problem?

Current video language models often struggle to effectively combine the visual and audio information in videos. This limitation makes it hard for them to understand the full context of what is happening in a video, which is essential for tasks like answering questions or generating captions. Many existing models also do not perform well when handling complex interactions between sound and visuals.

What's the solution?

VideoLLaMA 2 addresses these issues by introducing a special feature called the Spatial-Temporal Convolution (STC) connector, which helps the model better understand the movement and changes in videos over time. Additionally, it includes an Audio Branch that allows the model to learn from both video and audio data together. This joint training improves the model's ability to analyze and respond to questions about videos, making it more effective in tasks such as multiple-choice questions, open-ended questions, and video captioning.

Why it matters?

This research is important because it sets a new standard for how intelligent systems can analyze videos by integrating both visual and audio information. By improving multimodal understanding, VideoLLaMA 2 can enhance various applications, such as video analysis, content creation, and interactive media experiences. Furthermore, making these models publicly available encourages further research and development in this field.

Abstract

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

View Paper