Fast Video Generation with Sliding Tile Attention

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, Hao Zhang

2025-02-10

Fast Video Generation with Sliding Tile Attention

Summary

This paper talks about improving how AI understands videos by fixing how it tracks positions in moving images, making it better at handling long clips and complex scenes.

What's the problem?

Existing AI tools for analyzing videos get confused when tracking objects across many frames, especially with distracting elements or long durations. They treat video positions like book pages (1D) instead of 3D scenes with space and time, leading to errors like focusing on wrong objects or mishandling timing.

What's the solution?

The researchers upgraded the position-tracking system (RoPE) to work in 3D for videos. They made three key changes: 1) Using smoother time tracking to reduce flicker effects, 2) Keeping screen layouts balanced so the AI doesn't favor certain areas, and 3) Adding adjustable time spacing to better handle fast/slow motion. They also created a special test (V-NIAH-D) with tricky distractions to prove their method works better.

Why it matters?

This matters because better video understanding helps AI systems accurately analyze security footage, create smooth movie effects, and power advanced VR experiences. It reduces mistakes in applications like self-driving cars that need precise video analysis over long periods.

Abstract

Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.

View Paper