Scaling Diffusion Transformers Efficiently via μP

Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li

2025-05-23

Scaling Diffusion Transformers Efficiently via μP

Summary

This paper talks about a new technique called Maximal Update Parametrization, or μP, that helps make diffusion Transformers—special AI models used for things like generating images—work more efficiently and be easier to adjust for different tasks.

What's the problem?

When building and training powerful AI models like diffusion Transformers, it usually takes a lot of time and effort to find the right settings, called hyperparameters, for each new task or model size, which can be expensive and slow.

What's the solution?

The researchers showed that by using μP, they could make it much easier to transfer the best settings from one model to another, so you don't have to start from scratch each time, saving both time and computing resources.

Why it matters?

This matters because it means AI researchers and companies can develop and improve advanced models faster and more cheaply, which helps bring cool new AI features to everyone more quickly.

Abstract

Maximal Update Parametrization (μP) is extended to diffusion Transformers, demonstrating efficient hyperparameter transferability and reduced tuning costs across various models and tasks.

View Paper