DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion

Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, Raanan Fattal

2025-10-24

DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion

Summary

This paper introduces a new technique called Dynamic Position Extrapolation, or DyPE, which allows image-generating AI models to create much higher resolution images without needing to be retrained or taking longer to generate them.

What's the problem?

Creating incredibly detailed images with diffusion transformer models is really expensive in terms of computing power. This is because the way these models pay attention to different parts of the image gets dramatically slower as the image gets bigger. Specifically, the amount of calculation needed increases proportionally to the *square* of the number of pixels, making ultra-high resolution images impractical to generate.

What's the solution?

DyPE works by cleverly adjusting how the model understands the 'position' of different parts of the image during the image creation process. Think of it like fine-tuning the model's focus as it builds the image. It recognizes that the broad, basic shapes of an image appear early in the process, while the fine details take longer to emerge. By changing the positional encoding – essentially, the model’s internal map – at each step, DyPE ensures the model is focusing on the right level of detail at the right time, allowing it to generate images far beyond its original training resolution.

Why it matters?

This is a big deal because it unlocks the potential for creating stunningly detailed images without requiring massive amounts of computing resources. It means researchers and artists can generate images with 16 million pixels or more, achieving state-of-the-art quality, and it does so efficiently, without slowing down the image generation process. This makes high-resolution image generation more accessible and practical.

Abstract

Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.

View Paper