Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li, Kaiming He

2025-11-18

Back to Basics: Let Denoising Generative Models Denoise

Summary

This paper challenges the current way diffusion models, which are used for generating images, work. Instead of trying to predict the noise in an image, the researchers propose a new approach that focuses on directly predicting what the clean, original image should look like.

What's the problem?

Current diffusion models don't actually 'denoise' images in the traditional sense. They predict the noise *added* to an image, not the clean image itself. The authors argue that predicting noise is fundamentally harder than predicting the actual image data because images naturally exist within a simpler, more organized structure (called a manifold), while noise is random and doesn't follow that structure. Trying to predict noise in very detailed images can overwhelm the model and lead to poor results.

What's the solution?

The researchers developed a model called 'Just image Transformers' or JiT. This model uses a type of neural network called a Transformer, but it's designed to directly predict the clean image pixels without any extra steps like breaking the image into smaller pieces or pre-training on other data. They used large 'patches' of pixels as input, and surprisingly, this simple approach worked very well, even without the usual tricks used in other diffusion models.

Why it matters?

This research is important because it suggests we can build powerful image generation models with a simpler, more direct approach. By focusing on predicting the clean image itself, they were able to achieve competitive results with a relatively small and straightforward model, potentially making image generation more efficient and accessible. It also reinforces the idea that understanding the underlying structure of image data is key to building better AI.

Abstract

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

View Paper