VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction
Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
2026-01-12
Summary
This paper introduces VideoAR, a new way to create videos using a technique called autoregressive modeling. It's designed to be faster and more efficient than current methods, while still producing high-quality results.
What's the problem?
Currently, the best video generation models rely on diffusion or flow-matching, which are really good at making videos look realistic, but they require a lot of computing power and are hard to scale up for longer or more complex videos. Essentially, they're slow and expensive to use.
What's the solution?
VideoAR tackles this by predicting each frame of a video based on the frames that came before it, but in a smarter way. It breaks down the video into different scales and uses a special 'tokenizer' to understand how things move and change over time. They also added techniques like 'Temporal RoPE', 'Error Correction', and 'Random Frame Masking' to make sure the video stays consistent and doesn't get blurry or distorted over time. The model is trained in stages, starting with simple videos and gradually increasing the complexity.
Why it matters?
VideoAR is important because it shows that autoregressive models can compete with the best diffusion-based models in terms of video quality, but with a significant speed advantage. It's over ten times faster to generate videos, making it a more practical option for many applications and opening the door for future research in efficient video creation.
Abstract
Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.