Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan
2024-12-10

Summary
This paper talks about Divot, a new video tokenizer that uses a diffusion process to help large language models (LLMs) understand and generate videos more effectively.
What's the problem?
Understanding and generating videos is complex because videos contain both spatial (what is seen) and temporal (how things change over time) information. Existing methods often struggle to capture these aspects well, making it difficult for AI to create realistic videos or understand them deeply.
What's the solution?
The authors introduce Divot, which acts as a video tokenizer by breaking down videos into manageable parts while capturing their important features. It uses a diffusion process to learn how to represent the videos in a way that includes both their spatial and temporal characteristics. This means that Divot can help LLMs not only understand videos better but also generate new video clips from the learned information. Additionally, they developed a model called Divot-Vicuna that can convert video descriptions into actual video content and vice versa.
Why it matters?
This research is important because it enhances the capabilities of AI in handling video content, which is increasingly used in many applications like entertainment, education, and social media. By improving how AI understands and generates videos, Divot can lead to more sophisticated tools for storytelling and content creation, making technology more interactive and engaging.
Abstract
In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.