Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers

Yifan Zhou, Zeqi Xiao, Tianyi Wei, Shuai Yang, Xingang Pan

2025-12-19

Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers

Summary

This paper introduces a new way to make Diffusion Transformers, which are really good at creating images, work much faster and with higher resolution images. It focuses on improving how these models pay 'attention' to different parts of an image when generating it.

What's the problem?

Diffusion Transformers are powerful, but they become incredibly slow when dealing with long sequences of information, like the pixels in a high-resolution image. Existing methods to speed them up by focusing on only the most important parts still struggle because figuring out which parts are important takes a lot of time, and they need to look at more and more parts as the image gets bigger to maintain quality. Essentially, they're trying to simplify a complex image, but the simplification process itself becomes a bottleneck.

What's the solution?

The researchers developed a technique called Log-linear Sparse Attention (LLSA). Think of it like organizing information in a pyramid. LLSA first looks at the image in a very broad way, identifying key areas. Then, it zooms in on those areas, looking at more detail, and repeats this process. This hierarchical approach drastically reduces the amount of computation needed because it doesn't have to compare every single pixel to every other pixel. They also created a way to efficiently implement this on graphics cards, making it even faster.

Why it matters?

This work is important because it allows for the creation of much higher-resolution images using Diffusion Transformers without requiring massive amounts of computing power. It makes these powerful image generation models more practical and accessible, opening the door to even more realistic and detailed image creation. It's a significant step towards efficiently handling long sequences of data in these types of models, which has implications beyond just images.

Abstract

Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: https://github.com/SingleZombie/LLSA

View Paper