Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models
Qihao Liu, Zhanpeng Zeng, Ju He, Qihang Yu, Xiaohui Shen, Liang-Chieh Chen
2024-06-14

Summary
This paper discusses improvements to diffusion models used for generating high-quality images. It introduces a new method called DiMR, which uses a multi-resolution network and a special normalization technique to enhance image detail and reduce distortions.
What's the problem?
Traditional diffusion models often struggle with generating clear and detailed images because they rely on large patches of data, which can lose fine details. This is especially true when using Transformer architectures, which can become inefficient and produce distorted images due to the way they process information. The challenge is to create images that are both high-quality and computationally efficient.
What's the solution?
To solve this problem, the authors developed the DiMR framework, which processes images at multiple resolutions. This means it starts with lower-resolution details and gradually improves them to higher resolutions, allowing for better detail capture. They also introduced Time-Dependent Layer Normalization (TD-LN), which helps the model adapt over time and improve its performance without requiring a lot of extra resources. The results show that DiMR significantly outperforms previous models in generating images, achieving new high scores on standard benchmarks.
Why it matters?
This research is important because it enhances the ability of AI to create high-fidelity images, which has applications in fields like art, design, and virtual reality. By improving how these models generate images, DiMR can lead to better visual content creation tools that are more efficient and effective.
Abstract
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization. Diffusion models have gained prominence for their effectiveness in high-fidelity image generation. While conventional approaches rely on convolutional U-Net architectures, recent Transformer-based designs have demonstrated superior performance and scalability. However, Transformer architectures, which tokenize input data (via "patchification"), face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations concerning token length. While larger patch sizes enable attention computation efficiency, they struggle to capture fine-grained visual details, leading to image distortions. To address this challenge, we propose augmenting the Diffusion model with the Multi-Resolution network (DiMR), a framework that refines features across multiple resolutions, progressively enhancing detail from low to high resolution. Additionally, we introduce Time-Dependent Layer Normalization (TD-LN), a parameter-efficient approach that incorporates time-dependent parameters into layer normalization to inject time information and achieve superior performance. Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, where DiMR-XL variants outperform prior diffusion models, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512. Project page: https://qihao067.github.io/projects/DiMR