End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, Dahua Lin
2025-12-18
Summary
This paper focuses on improving how AI creates realistic videos, specifically by tackling a problem called 'exposure bias' that happens when the AI is trained on one type of video data but then asked to create videos different from what it saw during training.
What's the problem?
When AI models generate videos step-by-step (like predicting the next frame), they often struggle with longer videos because small errors early on can snowball and lead to unrealistic results. This is worsened by 'exposure bias' – the model only sees perfect, real videos during training, so it doesn't learn to handle its own mistakes when generating new videos. Previous attempts to fix this usually require a second, already-trained AI to help, or a system that checks the video as it's being made, which isn't ideal.
What's the solution?
The researchers developed a new method called 'Resampling Forcing' that allows the AI to learn to create videos without needing a second AI or real-time checking. The core idea is to intentionally add errors to the video frames *during* training, forcing the AI to learn how to recover from its own mistakes. They also introduced a way for the AI to efficiently remember and use only the most important parts of the video history when generating new frames, which helps with longer, more complex videos.
Why it matters?
This research is important because it provides a more efficient and effective way to train AI models to generate high-quality, realistic videos. By eliminating the need for extra AI helpers or constant checking, it makes it easier to create longer, more consistent videos, bringing us closer to AI that can truly simulate and create visual worlds.
Abstract
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.