Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework
Jing Wang, Fengzhuo Zhang, Xiaoli Li, Vincent Y. F. Tan, Tianyu Pang, Chao Du, Aixin Sun, Zhuoran Yang
2025-03-18
Summary
This paper presents a theoretical analysis of Auto-Regressive Video Diffusion Models (ARVDMs), which are used to generate realistic long videos, and uses these insights to improve their performance.
What's the problem?
While ARVDMs are good at generating videos, there's a lack of understanding about how they work and what limits their performance. Two key problems are error accumulation (errors building up over time) and memory bottleneck (limitations on how much past information the model can use).
What's the solution?
The researchers developed a unified framework called Meta-ARVDM to analyze ARVDMs. They identified the error accumulation and memory bottleneck problems and tried to address the memory bottleneck by designing networks that use more past frames and by compressing frames to use memory more efficiently.
Why it matters?
This work matters because it provides a better understanding of how ARVDMs work, which can lead to improved video generation models that can create longer and more realistic videos.
Abstract
A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos. However, theoretical analyses of these models remain scant. In this work, we develop theoretical underpinnings for these models and use our insights to improve the performance of existing models. We first develop Meta-ARVDM, a unified framework of ARVDMs that subsumes most existing methods. Using Meta-ARVDM, we analyze the KL-divergence between the videos generated by Meta-ARVDM and the true videos. Our analysis uncovers two important phenomena inherent to ARVDM -- error accumulation and memory bottleneck. By deriving an information-theoretic impossibility result, we show that the memory bottleneck phenomenon cannot be avoided. To mitigate the memory bottleneck, we design various network structures to explicitly use more past frames. We also achieve a significantly improved trade-off between the mitigation of the memory bottleneck and the inference efficiency by compressing the frames. Experimental results on DMLab and Minecraft validate the efficacy of our methods. Our experiments also demonstrate a Pareto-frontier between the error accumulation and memory bottleneck across different methods.