Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie

2026-01-23

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Summary

This paper investigates whether a new approach to building image generation models, called Representation Autoencoders (RAEs), can be successfully used for creating images from text descriptions, similar to how it's been used for images generally. It compares RAEs to a more traditional method using Variational Autoencoders (VAEs).

What's the problem?

Creating detailed and realistic images from text is really hard, especially when you want the model to understand complex ideas and generate diverse images. Existing methods, like those using VAEs, often struggle with generating high-quality images and can quickly become unstable during training, meaning they 'forget' what they've learned. The question is whether RAEs, which have worked well for image-to-image tasks, can overcome these challenges in the text-to-image world.

What's the solution?

The researchers first expanded the RAE system to handle the complexity of text-to-image generation by training it on a huge amount of data from the internet, synthetic images, and images with text directly rendered on them. They then simplified the RAE design, finding that some complex parts weren't actually helpful when the model got bigger. Finally, they directly compared RAEs to VAEs of different sizes, showing that RAEs consistently performed better during both initial training and when fine-tuned on high-quality image datasets, and were much more stable over longer training periods.

Why it matters?

This work shows that RAEs are a promising new foundation for building powerful text-to-image models. They are simpler to use, train faster, and produce better quality images than existing methods like VAEs. Because RAEs work with a shared understanding of both images and text, it also opens the door to creating AI systems that can not only generate images from text but also 'think' about those generated images in a more meaningful way, potentially leading to more advanced multimodal AI.

Abstract

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

View Paper