UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers

Min Zhao, Bokai Yan, Xue Yang, Hongzhou Zhu, Jintao Zhang, Shilong Liu, Chongxuan Li, Jun Zhu

2025-12-05

UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers

Summary

This paper introduces UltraImage, a new method for creating high-resolution images using diffusion transformers, which are a type of AI model. It focuses on improving the quality and detail of images generated at sizes larger than what the model was originally trained on.

What's the problem?

Current image generation models, while good at creating realistic images at certain sizes, often run into problems when asked to create images much larger than those they were trained with. Specifically, they tend to repeat patterns and lose overall image quality, becoming blurry or lacking fine details. This happens because the way the model understands the position of things in the image gets messed up when scaling up, and the model's attention to detail gets spread too thin.

What's the solution?

The researchers tackled these issues in two main ways. First, they analyzed how the model represents position and found that repeating patterns were linked to a dominant frequency in that representation. They corrected this frequency to prevent repetition. Second, they developed a way to focus the model's attention more effectively, sharpening details in local areas while maintaining the overall structure of the image. This 'entropy-guided adaptive attention' helps the model concentrate on what's important for a clear, high-quality image.

Why it matters?

UltraImage is important because it allows for the generation of much larger and more detailed images than previously possible without relying on tricks like starting with a low-resolution image and then upscaling it. This opens up possibilities for creating high-quality visuals for applications like art, design, and scientific visualization, pushing the boundaries of what AI image generation can achieve.

Abstract

Recent image diffusion transformers achieve high-fidelity generation, but struggle to generate images beyond these scales, suffering from content repetition and quality degradation. In this work, we present UltraImage, a principled framework that addresses both issues. Through frequency-wise analysis of positional embeddings, we identify that repetition arises from the periodicity of the dominant frequency, whose period aligns with the training resolution. We introduce a recursive dominant frequency correction to constrain it within a single period after extrapolation. Furthermore, we find that quality degradation stems from diluted attention and thus propose entropy-guided adaptive attention concentration, which assigns higher focus factors to sharpen local attention for fine detail and lower ones to global attention patterns to preserve structural consistency. Experiments show that UltraImage consistently outperforms prior methods on Qwen-Image and Flux (around 4K) across three generation scenarios, reducing repetition and improving visual fidelity. Moreover, UltraImage can generate images up to 6K*6K without low-resolution guidance from a training resolution of 1328p, demonstrating its extreme extrapolation capability. Project page is available at https://thu-ml.github.io/ultraimage.github.io/{https://thu-ml.github.io/ultraimage.github.io/}.

View Paper