Towards Scalable Pre-training of Visual Tokenizers for Generation

Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang

2025-12-16

Towards Scalable Pre-training of Visual Tokenizers for Generation

Summary

This paper focuses on improving how computers 'understand' images before using them to create new ones, specifically by looking at the way images are compressed into a simpler representation called a 'latent space'.

What's the problem?

Currently, when training computers to compress images, the focus is on making the reconstructed image look as close to the original as possible. However, this method prioritizes low-level details like edges and colors, and doesn't necessarily help the computer grasp the overall *meaning* of the image. This means that spending more and more computing power on this type of training doesn't actually lead to better results when the computer tries to *generate* new, realistic images – this is called the 'pre-training scaling problem'.

What's the solution?

The researchers developed a new training method called VTP that doesn't just focus on reconstruction. Instead, it simultaneously trains the computer to understand the image's content, relate it to text descriptions, and reconstruct it. This combined approach forces the computer to learn a more meaningful and concise representation of the image, focusing on high-level concepts rather than just pixel-level accuracy.

Why it matters?

This work is important because it shows that understanding is key to good image generation. By improving the way computers represent images internally, they can generate much higher quality images with the same amount of computing power, and the benefits of increased computing power actually translate into better results. This could lead to significant advancements in creating realistic images for various applications like art, design, and virtual reality.

Abstract

The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.

View Paper