xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles

2024-10-23

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Summary

This paper introduces xGen-MM-Vid (also known as BLIP-3-Video), a new model that processes videos more efficiently by using only 32 tokens instead of thousands to represent the same video information.

What's the problem?

Traditional models for understanding videos often need a huge number of tokens (data pieces) to represent the visual information in each frame, which makes them slow and resource-intensive. For example, some models require over 4,600 tokens just to analyze a short video, making it hard to handle longer videos effectively.

What's the solution?

The researchers developed BLIP-3-Video, which uses a special feature called a 'temporal encoder' to reduce the number of tokens needed to just 32. This encoder helps the model focus on the most important information from multiple frames of a video, allowing it to maintain high accuracy in tasks like answering questions about the video while being much smaller and faster than other models.

Why it matters?

This advancement is significant because it allows for quicker processing of video data without sacrificing quality. It opens up new possibilities for applications in areas like entertainment, education, and autonomous vehicles, where efficient video analysis is crucial.

Abstract

We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html

View Paper