Image Tokenizer Needs Post-Training

Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides

2025-09-18

Summary

This paper focuses on improving how AI creates images by tackling a problem with the way images are broken down into a code-like representation before being generated. It proposes a new way to train the 'code-breaker' part of the system to better understand the nuances of image creation, leading to higher quality and faster image generation.

What's the problem?

Current image-generating AI models use a system where images are first converted into a series of 'tokens,' like words in a sentence, to make them easier to work with. However, the way these tokens are created prioritizes accurately *reconstructing* existing images, not necessarily *generating* new, realistic ones. This creates a mismatch: the AI is good at copying, but not as good at inventing. The existing tokenizers don't account for the errors that happen when the AI tries to create something new, leading to lower quality images and slower training.

What's the solution?

The researchers developed a two-part training process for the tokenizer. First, during 'main training,' they intentionally added noise to the token creation process to simulate the errors that happen during image generation. This makes the tokenizer more robust and better at handling unexpected tokens. They also created a new way to measure how well the tokenizer is performing, called pFID, which directly relates to the quality of the generated images. Second, during 'post-training,' they fine-tuned the tokenizer to work specifically with a pre-trained image generator, further reducing the difference between how images are represented when being created versus when being reconstructed. This combined approach improves both the quality and speed of image generation.

Why it matters?

This work is important because it addresses a fundamental bottleneck in image generation. By improving the tokenizer, the AI can create more realistic and detailed images more efficiently. This has implications for a wide range of applications, from art and design to scientific visualization and beyond. The new evaluation metric, pFID, also provides a better way to assess the performance of tokenizers and track progress in the field.

Abstract

Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a sim400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.

View Paper