Taming Teacher Forcing for Masked Autoregressive Video Generation
Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, Heung-Yeung Shum
2025-01-22

Summary
This paper talks about a new way to make AI generate videos called MAGI. It's like teaching a computer to draw a comic strip frame by frame, but with moving pictures instead of still ones.
What's the problem?
Current methods for making AI generate videos have trouble creating long, smooth sequences that look natural. It's like trying to make a flip book animation where each page doesn't quite match up with the ones before and after it. This makes the videos look choppy or unrealistic, especially when trying to make longer videos.
What's the solution?
The researchers came up with a clever trick called Complete Teacher Forcing (CTF). It's like showing the AI a few frames of a video and asking it to predict what comes next, but always giving it clear, complete pictures to work from. They also used some special training techniques to help the AI learn better. This new method, MAGI, can create videos that are over 100 frames long, even when it only learned from 16-frame clips.
Why it matters?
This matters because it could make AI-generated videos look much more realistic and natural. Imagine being able to type in a description and have a computer create a whole movie scene for you. This technology could be used in video games, special effects for movies, or even in education to create visual explanations of complex topics. It's a big step towards making AI that can create longer, more coherent visual stories without needing humans to draw every frame.
Abstract
We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.