Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers
Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J. Kim, Aliaksandr Siarohin, Anil Kag
2025-10-28
Summary
This paper introduces a new technique called SPRINT to make Diffusion Transformers, which are really good at creating things like images, much faster and cheaper to train.
What's the problem?
Diffusion Transformers are powerful, but they require a lot of computing power and time to train, especially when dealing with long sequences of data like high-resolution images. Simply removing some of the data during training to speed things up usually makes the final results worse, and existing solutions to this problem are either complicated or don't work well when you remove a lot of data.
What's the solution?
SPRINT works by smartly using different layers of the Diffusion Transformer. The early layers process all the data to capture fine details, while the deeper layers work with only a small, selected portion of the data to reduce the amount of computation needed. These layers are then combined using a technique called residual connections. The training process happens in two steps: first, a long pre-training phase where data is dropped to improve efficiency, and then a shorter fine-tuning phase where all the data is used to refine the results. They also developed a method called Path-Drop Guidance to speed up image generation.
Why it matters?
SPRINT is important because it allows researchers and developers to train these powerful Diffusion Transformers much more efficiently, saving time and money. It achieves significant speedups without sacrificing the quality of the generated images, making it a practical solution for large-scale image generation tasks and a step towards making this technology more accessible.
Abstract
Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet na\"ive strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.