Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie
2025-10-14
Summary
This paper focuses on improving how images are prepared for use with Diffusion Transformers, a type of AI that generates images. It's about finding a better way to translate images into a format the AI can understand and work with.
What's the problem?
Currently, Diffusion Transformers rely on a relatively old and limited system called a VAE (Variational Autoencoder) to convert images into a simpler 'latent space' representation. This VAE has a few drawbacks: its underlying technology is outdated, it doesn't capture enough detail from the image, and it's trained only to recreate the original image, not to understand what the image *means*. These limitations ultimately hinder the quality of the images the AI can generate.
What's the solution?
The researchers propose replacing the standard VAE with something called a Representation Autoencoder (RAE). RAEs use more modern 'representation encoders' – like DINO, SigLIP, and MAE – which are better at understanding the content of an image and creating a richer, more detailed latent space. However, these new latent spaces are much larger, making it harder for the Diffusion Transformer to process them. The team analyzed why this happens and developed techniques to help the transformer work effectively with these high-dimensional spaces, achieving faster and better results without needing extra training steps to align the representations.
Why it matters?
This work is important because it significantly improves the quality of images generated by Diffusion Transformers. By using RAEs, the AI can create more realistic and detailed images, achieving state-of-the-art results on standard image generation benchmarks. The researchers suggest that RAEs should become the standard method for preparing images for use with these powerful AI models.
Abstract
Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.