Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo
2024-11-25

Summary
This paper introduces CoordTok, a new method for efficiently breaking down long videos into smaller, manageable parts called tokens, which helps improve the training of models that analyze video content.
What's the problem?
Training models to understand long videos is challenging because existing methods require processing all frames at once, which can be very resource-intensive and slow. This makes it hard to create effective models that can analyze video content in detail without using too much computing power or time.
What's the solution?
CoordTok solves this problem by using a coordinate-based approach to tokenize videos. Instead of trying to reconstruct every frame simultaneously, CoordTok breaks the video into smaller patches based on specific coordinates. This allows it to encode a long video more efficiently, significantly reducing the number of tokens needed. For example, it can turn a 128-frame video into just 1280 tokens, while other methods might need over 6000 tokens for similar quality. This method also enables faster and more memory-efficient training of models that generate video content.
Why it matters?
This research is important because it makes it easier and more efficient to train AI models on long videos, which can lead to better performance in tasks like video analysis and generation. By improving how videos are tokenized, CoordTok can help advance technology in fields such as entertainment, education, and artificial intelligence.
Abstract
Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled (x,y,t) coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128times128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.