Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen

2025-01-17

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Summary

This paper talks about improving the way AI systems compress and recreate images and videos. The researchers focused on a part called the 'visual tokenizer', which turns complex visual information into a simpler form that AI can work with more easily.

What's the problem?

While AI systems for generating images and videos have gotten better by making the main part (the generator) bigger, the part that compresses the images (the tokenizer) hasn't been improved much. This means we don't really know how changing the tokenizer affects how well the AI can recreate images or generate new ones.

What's the solution?

The researchers created a new type of tokenizer called ViTok. They tested it on huge sets of images and videos, much bigger than what's usually used. They looked at how changing different parts of ViTok affected its performance. They found that making the part that decodes the compressed information bigger helped a lot with recreating images, but had mixed results for generating new ones. Using what they learned, they made ViTok work really well while using less computer power than other systems.

Why it matters?

This matters because it could lead to AI that can create more realistic images and videos while using less computing power. This could make AI-generated content more accessible and improve things like special effects in movies, creating virtual worlds for games, or even helping designers and artists come up with new ideas. It's a step towards making AI visual systems more efficient and effective, which could have wide-ranging impacts in fields from entertainment to education.

Abstract

Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.

View Paper