Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang

2025-03-25

Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent
Diffusion Models

Summary

This paper is about creating super high-quality images with AI, like the kind you'd see on a fancy 4K TV.

What's the problem?

There wasn't a good set of images to train AI on to make these ultra-high-quality pictures, and it was hard to judge if the AI was doing a good job.

What's the solution?

The researchers created a new set of 4K images and a way to measure how good the AI-generated pictures are. They also found a way to train the AI to make these detailed pictures directly.

Why it matters?

This work matters because it helps AI create more realistic and detailed images, which can be used in many things like movies, games, and even medical imaging.

Abstract

In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.

View Paper