Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Xingyu Zhou, Qifan Li, Xiaobin Hu, Hai Chen, Shuhang Gu

2026-01-01

Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Summary

This paper focuses on improving the quality of images generated by diffusion models, which are a type of AI that creates images from noise. They introduce a new technique called Internal Guidance to help these models produce more realistic and detailed pictures.

What's the problem?

Diffusion models are really good at creating images that look like the data they were trained on, but they struggle with generating high-quality images for things that are less common in the training data. Existing methods to fix this, like 'classifier-free guidance,' can sometimes make images look too simple or distorted. Other approaches require a lot of extra work during training or take a long time to generate the final image.

What's the solution?

The researchers developed a method called Internal Guidance (IG). This involves adding a little extra 'teaching signal' during the training process, focusing on what's happening inside the model, not just at the end. Then, when creating an image, they use information from different layers within the model to build the final result. This approach is surprisingly simple but significantly improves both how quickly the model learns and the quality of the images it produces.

Why it matters?

This work is important because it makes diffusion models better at generating a wider variety of high-quality images, even for things they haven't seen much of during training. The results show a significant improvement in image quality, achieving state-of-the-art performance on standard image generation benchmarks, meaning it's currently one of the best methods available for creating realistic images with AI.

Abstract

The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.

View Paper