UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers
Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, Jun Zhu
2025-11-26
Summary
This paper investigates a problem with video generation models called 'video length extrapolation,' where models struggle to create realistic videos longer than the videos they were trained on.
What's the problem?
Current video generation models, specifically those using a technique called 'diffusion transformers,' have trouble extending videos beyond their original training length. This manifests in two main ways: the video starts repeating sections, or the overall quality of the video noticeably decreases as it gets longer. Previous attempts to fix the repetition issue didn't address the quality problem and weren't very effective at allowing for much longer videos.
What's the solution?
The researchers discovered that both the repetition and quality issues stem from a core problem: the model's 'attention' gets spread too thin when trying to process longer videos. 'Attention' is how the model focuses on different parts of the video to create a coherent output. When dealing with lengths outside of what it was trained on, the attention gets diluted, leading to the problems. To fix this, they developed a method called UltraViCo that simply reduces the model's attention to parts of the video beyond the training length, essentially telling it to ignore those extra parts. This method doesn't require any retraining and can be easily added to existing models.
Why it matters?
This research is important because it significantly improves the ability of video generation models to create longer, higher-quality videos. UltraViCo allows for videos to be extended much further than previous methods, improving both the visual quality and reducing annoying repetitions. It also works well with other video editing and creation tasks, making it a versatile improvement to the field of AI video generation.
Abstract
Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.