PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo

2025-12-03

PixelDiT: Pixel Diffusion Transformers for Image Generation

Summary

This paper introduces a new way to build image-generating AI models, specifically improving upon a type called Diffusion Transformers. It focuses on creating higher quality images more efficiently.

What's the problem?

Current Diffusion Transformers usually work in two steps: first compressing an image, then generating from that compressed version. This compression loses some detail and can cause errors to build up, making it hard to create really sharp, realistic images and limiting how well the different parts of the model can work together to improve.

What's the solution?

The researchers created a model called PixelDiT that skips the compression step altogether. It works directly with the raw pixel data of the image. PixelDiT uses a special design with two levels of transformers – one to understand the overall image and another to focus on the fine details, allowing it to generate high-quality images directly in pixel space without losing information.

Why it matters?

PixelDiT achieves significantly better results than other models that work directly with pixels, and its performance is getting very close to the best models that *do* use compression. This is important because it opens the door to simpler, more efficient, and potentially higher-quality image generation, especially for high-resolution images and creating images from text descriptions.

Abstract

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

View Paper