Latent Denoising Makes Good Visual Tokenizers

Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, Yue Wang

2025-07-22

Latent Denoising Makes Good Visual Tokenizers

Summary

This paper talks about the Latent Denoising Tokenizer (l-DeTok), a new way to help AI models understand and create images by training the tokenizer to clean up noisy and corrupted data in its hidden layers.

What's the problem?

The problem is that current tokenizers, which break down images into smaller pieces for AI to process, are not always good at handling noisy or imperfect information, making it harder for AI to generate high-quality images.

What's the solution?

The authors designed l-DeTok to directly train the tokenizer to fix corrupted latent embeddings by adding noise and masking parts of the data during training, teaching the model to reconstruct clean images from imperfect inputs.

Why it matters?

This matters because it helps AI models generate clearer and more accurate images across different tasks, improving the quality and reliability of AI-based image generation.

Abstract

Latent Denoising Tokenizer (l-DeTok) improves generative modeling by aligning tokenizer embeddings with a denoising objective, outperforming standard tokenizers across various models.

View Paper