Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, Xiangxiang Chu
2025-10-15
Summary
This paper tackles the challenge of making image generation models that work directly with pixels as good as those that work with a compressed 'code' of the image. It introduces a new way to train these pixel-based models, specifically diffusion and consistency models, to achieve better quality and efficiency.
What's the problem?
Typically, image generation models that operate directly on pixels are harder to train and produce lower quality images compared to models that first compress the image into a smaller representation, often called a 'latent space'. This creates a noticeable gap in performance and how efficiently these models can generate images. Pixel-space models struggle to learn the complex patterns needed for realistic image creation.
What's the solution?
The researchers developed a two-step training process. First, they trained a component called an 'encoder' to understand the important features of clean images and map them to points along a predictable path that leads from random noise to a recognizable image. Then, they combined this encoder with a randomly started 'decoder' and fine-tuned the entire system together, for both diffusion and consistency models. This allows the model to learn how to generate images directly from pixels, guided by the encoder's understanding of image structure.
Why it matters?
This work is significant because it demonstrates that pixel-space models *can* achieve performance comparable to, and even surpass, latent-space models, without the need for complex pre-training with other models like VAEs. This is especially important for generating high-resolution images, as it opens the door to more efficient and high-quality image generation directly from pixel data, and represents the first successful training of a consistency model directly on high-resolution images.
Abstract
Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.