VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao

2025-01-23

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Summary

This paper talks about VideoLLaMA3, a new AI model that's really good at understanding both images and videos. It's like teaching a computer to see and understand visual information the way humans do, but with some clever tricks to make it work even better.

What's the problem?

Current AI models that work with images and videos often struggle to understand them as well as humans do. They might miss important details or have trouble making sense of complex scenes. It's especially hard to create models that are good at understanding both still images and moving videos, as these require different skills.

What's the solution?

The researchers created VideoLLaMA3 with a 'vision-centric' approach. This means they focused on teaching the AI to really understand visual information first. They used high-quality image and text data to train the AI in four stages. First, they taught it to recognize basic visual elements. Then, they trained it on lots of images with descriptions to help it understand what it's seeing. After that, they fine-tuned it on specific tasks and added video understanding. They also made the AI flexible in how it looks at images and videos, allowing it to focus on important details and ignore less important parts.

Why it matters?

This matters because better AI for understanding images and videos could have many real-world uses. It could help create smarter security cameras, improve medical imaging, make self-driving cars safer, or even help visually impaired people understand their surroundings better. By making an AI that's good at both images and videos, VideoLLaMA3 could be a versatile tool for many different applications, potentially leading to new innovations in fields that rely on visual information.

Abstract

In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the vision-centric training paradigm and vision-centric framework design. The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding. Instead of preparing massive video-text datasets, we focus on constructing large-scale and high-quality image-text datasets. VideoLLaMA3 has four training stages: 1) vision-centric alignment stage, which warms up the vision encoder and projector; 2) vision-language pretraining stage, which jointly tunes the vision encoder, projector, and LLM with large-scale image-text data covering multiple types (including scene images, documents, charts) as well as text-only data. 3) multi-task fine-tuning stage, which incorporates image-text SFT data for downstream tasks and video-text data to establish a foundation for video understanding. 4) video-centric fine-tuning, which further improves the model's capability in video understanding. As for the framework design, to better capture fine-grained details in images, the pretrained vision encoder is adapted to encode images of varying sizes into vision tokens with corresponding numbers, rather than a fixed number of tokens. For video inputs, we reduce the number of vision tokens according to their similarity so that the representation of videos will be more precise and compact. Benefit from vision-centric designs, VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.

View Paper