DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, Qi Tian
2025-11-25
Summary
This paper introduces a new way to create images directly from pixels, aiming for better quality and efficiency than previous methods.
What's the problem?
Existing methods that generate images directly in pixel space are slow because they try to handle both the broad shapes and fine details of an image at the same time within one complex system. It's like trying to paint a whole landscape and zoom in to paint individual leaves with the same brush – it’s not very efficient.
What's the solution?
The researchers developed a system called DeCo that separates the process. First, a core part of the system focuses on creating the overall structure and meaning of the image. Then, a separate, simpler part adds in the detailed textures and features. They also improved how the system learns by focusing on the most important visual parts of the image, ignoring less noticeable details. This division of labor makes the process faster and more effective.
Why it matters?
This work is important because it makes direct pixel image generation a more viable option. DeCo achieves image quality that’s very close to more complex methods, but does so more quickly and efficiently, potentially opening the door to faster and better image creation tools.
Abstract
Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.