DiTFastAttn: Attention Compression for Diffusion Transformer Models
Zhihang Yuan, Pu Lu, Hanling Zhang, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang
2024-06-14

Summary
This paper presents DiTFastAttn, a new method designed to make diffusion transformer models more efficient for generating images and videos. It addresses the computational challenges that arise from the self-attention mechanism used in these models.
What's the problem?
Diffusion transformers are powerful tools for creating images and videos, but they can be very slow and require a lot of computing power because of how they process information. Specifically, the self-attention mechanism, which helps the model focus on important parts of the input, has a complexity that grows quickly with the amount of data, making it hard to use for high-resolution tasks.
What's the solution?
To solve this problem, the authors developed DiTFastAttn, which compresses the attention computation in three main ways: first, it uses Window Attention with Residual Caching to reduce the focus on local details that many attention heads share; second, it applies Temporal Similarity Reduction to take advantage of similarities between attention outputs at different steps; and third, it implements Conditional Redundancy Elimination to skip unnecessary calculations during certain types of generation. These techniques help speed up the model and reduce the amount of processing needed.
Why it matters?
This research is significant because it makes diffusion transformer models faster and more efficient without sacrificing quality. By improving how these models handle attention computations, DiTFastAttn can help make advanced image and video generation more accessible for various applications, such as art creation, video games, and virtual reality.
Abstract
Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to self-attention's quadratic complexity. We propose DiTFastAttn, a novel post-training compression method to alleviate DiT's computational bottleneck. We identify three key redundancies in the attention computation during DiT inference: 1. spatial redundancy, where many attention heads focus on local information; 2. temporal redundancy, with high similarity between neighboring steps' attention outputs; 3. conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. To tackle these redundancies, we propose three techniques: 1. Window Attention with Residual Caching to reduce spatial redundancy; 2. Temporal Similarity Reduction to exploit the similarity between steps; 3. Conditional Redundancy Elimination to skip redundant computations during conditional generation. To demonstrate the effectiveness of DiTFastAttn, we apply it to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Evaluation results show that for image generation, our method reduces up to 88\% of the FLOPs and achieves up to 1.6x speedup at high resolution generation.