DiP: Taming Diffusion Models in Pixel Space
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, Ying Tai
2025-12-01
Summary
This paper introduces DiP, a new way to create images using diffusion models that aims to be both high quality and fast.
What's the problem?
Existing image generation methods face a challenge: getting both high-quality images *and* doing it quickly. Latent Diffusion Models are fast, but they can lose some detail because they work with compressed image data. Other methods that work directly with the full image are very slow, especially when trying to create high-resolution pictures.
What's the solution?
DiP solves this by breaking down image creation into two steps. First, a 'Diffusion Transformer' builds the overall structure of the image using larger blocks. Then, a smaller, faster component called a 'Patch Detailer Head' adds in the fine details, using information from the surrounding areas. This two-step process is efficient like Latent Diffusion Models, but avoids the detail loss because it works directly with the image pixels without relying on compression.
Why it matters?
DiP is important because it significantly speeds up the process of generating high-quality images. It's up to ten times faster than previous methods while only slightly increasing the model's size, and it produces images that are just as good, if not better, as measured by standard image quality metrics.
Abstract
Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10times faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.79 FID score on ImageNet 256times256.