LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

Dor Shmilovich, Tony Wu, Aviad Dahan, Yuval Domb

2025-11-17

LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

Summary

This paper focuses on making video generation using diffusion transformers faster without sacrificing the quality of the videos produced.

What's the problem?

Creating high-quality videos with diffusion transformers is computationally expensive because of how they handle 'attention'. Specifically, the amount of processing needed increases dramatically as the video gets longer – it's a 'quadratic' problem. Previous attempts to speed things up either weren't efficient enough because they constantly had to re-evaluate what parts of the video needed the most attention, or they weren't adaptable enough because they used a fixed strategy that wasn't always optimal throughout the video creation process.

What's the solution?

The researchers noticed that during video generation, the parts of the video that don't need much attention at one stage usually don't need much attention in the following stages either. They developed a method called LiteAttention that remembers which parts of the video are less important and skips processing them repeatedly. This 'skip' decision is made early on and then carried forward, avoiding the need to constantly re-check. It’s built on top of existing fast attention techniques like FlashAttention to maximize speed.

Why it matters?

This work is important because it significantly speeds up video generation using diffusion transformers, making it more practical for real-world applications. It achieves this speedup without any loss in video quality, offering a good balance between efficiency and performance. This could lead to faster creation of videos for entertainment, special effects, and other fields.

Abstract

Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step t typically remain so at step t+δ. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.

View Paper