DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Dahye Kim, Deepti Ghadiyaram, Raghudeep Gadde

2026-02-20

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Summary

This paper introduces a new way to make image and video generation using Diffusion Transformers, a powerful but computationally expensive technique, much faster.

What's the problem?

Diffusion Transformers are really good at creating images and videos, but they take a lot of computing power and time. This is because they always break down the image into the same-sized pieces, even when some parts of the image are simple and don't need as much detail as others. It's like using a magnifying glass on everything, even things you can already see clearly.

What's the solution?

The researchers came up with a method called 'dynamic tokenization'. Instead of using fixed-size pieces, the system changes the size of the pieces it uses depending on how complex the image is and how far along the generation process is. Early on, when the overall structure is being formed, it uses larger pieces. Later, when details are being added, it uses smaller pieces. This allows the system to focus its efforts where they're needed most, making it more efficient.

Why it matters?

This new approach significantly speeds up the image and video generation process – up to 3.5 times faster on some datasets – without sacrificing the quality of the generated images or how well they match the original instructions. This means we can create high-quality images and videos more quickly and with less computing power, making this technology more accessible and practical.

Abstract

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to 3.52times and 3.2times speedup on FLUX-1.Dev and Wan 2.1, respectively, without compromising the generation quality and prompt adherence.

View Paper