FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Haosen Yang, Adrian Bulat, Isma Hadji, Hai X. Pham, Xiatian Zhu, Georgios Tzimiropoulos, Brais Martinez

2024-12-02

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Summary

This paper introduces FAM Diffusion, a new method for generating high-resolution images using diffusion models while avoiding common problems like repetitive patterns and distortions.

What's the problem?

Diffusion models are great at creating high-quality images, but they usually only work well at the resolution they were trained on. When trying to generate images at different resolutions, they can produce weird patterns and structural issues. Retraining these models for higher resolutions is often too costly and time-consuming.

What's the solution?

FAM Diffusion solves these problems by introducing two new modules: Frequency Modulation (FM) and Attention Modulation (AM). The FM module improves the overall structure of the image by focusing on low-frequency details, while the AM module ensures that the finer textures and details are consistent across the image. This method can be added to existing diffusion models without needing extra training, making it efficient and effective.

Why it matters?

This research is important because it enhances how we generate high-resolution images, making it easier to create visually appealing content without the typical issues that arise from changing resolutions. By improving image quality and maintaining diversity in generated images, FAM Diffusion can be useful in various fields like art, design, and virtual reality.

Abstract

Diffusion models are proficient at generating high-quality images. They are however effective only when operating at the resolution used during training. Inference at a scaled resolution leads to repetitive patterns and structural distortions. Retraining at higher resolutions quickly becomes prohibitive. Thus, methods enabling pre-existing diffusion models to operate at flexible test-time resolutions are highly desirable. Previous works suffer from frequent artifacts and often introduce large latency overheads. We propose two simple modules that combine to solve these issues. We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works. Our method, coined Fam diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training. Extensive qualitative results highlight the effectiveness of our method in addressing structural and local artifacts, while quantitative results show state-of-the-art performance. Also, our method avoids redundant inference tricks for improved consistency such as patch-based or progressive generation, leading to negligible latency overheads.

View Paper