UltraGen: High-Resolution Video Generation with Hierarchical Attention
Teng Hu, Jiangning Zhang, Zihan Su, Ran Yi
2025-10-22
Summary
This paper introduces UltraGen, a new system for creating high-resolution videos using artificial intelligence.
What's the problem?
Currently, creating videos with AI is really good, but it's limited to lower quality, like 720p. This is because the technology used to understand relationships between different parts of the video requires a lot of computing power, and that power increases dramatically as the video gets bigger and more detailed. Trying to make 1080p, 2K, or 4K videos directly with existing methods is just too slow and expensive.
What's the solution?
UltraGen solves this by using a smarter way to pay attention to details in the video. It breaks down the attention process into two parts: one that focuses on small, local areas for sharp details, and another that looks at the whole video to make sure everything makes sense together. It also compresses information to make the global view more efficient and uses a special technique to share information between those local areas, all while reducing the amount of computing needed.
Why it matters?
This work is important because it allows AI to create high-resolution videos for the first time without relying on tricks like first making a low-resolution video and then trying to upscale it. This opens the door for more realistic and detailed videos in things like movies, games, and virtual reality, and makes the whole process much more practical.
Abstract
Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.